Qwen3-VL-4B-Thinking-SafeGRPO

This repository contains a safety-aligned multimodal reasoning model fine-tuned from Qwen/Qwen3-VL-4B-Thinking using GRPO with the verl reinforcement learning framework.

The model is designed for research on post-training safety alignment of multimodal large language models, especially in scenarios involving image-text understanding, reasoning, and safe response generation.

Model Details

  • Base model: Qwen/Qwen3-VL-4B-Thinking
  • Fine-tuning method: GRPO
  • Training framework: verl
  • Rollout engine: vLLM
  • Model type: Vision-Language Model
  • Training objective: Multimodal safety alignment through reinforcement learning
  • License: Apache-2.0

Training Setup

The model was fine-tuned using Group Relative Policy Optimization, implemented with the verl framework.

The main training configuration is shown below:

#!/usr/bin/env bash

set -x

PROJECT_NAME=verl_grpo
EXPERIMENT_NAME=qwen3_vl_4b_thinking_safegrpo

ENGINE=${1:-vllm}

GPU_UTILIZATION=0.6

MODEL_PATH=Qwen/Qwen3-VL-4B-Thinking
TRAIN_FILES=./train_data/safetygrpo_train.parquet
VAL_FILES=./train_data/safetygrpo_test.parquet

TRAIN_BATCH_SIZE=256
MAX_PROMPT_LENGTH=2048
MAX_RESPONSE_LENGTH=4096
ROLLOUT_N=8
PPO_MINI_BATCH_SIZE=64
PPO_MICRO_BATCH_SIZE_PER_GPU=16
LOG_PROB_MICRO_BATCH_SIZE_PER_GPU=16
TENSOR_MODEL_PARALLEL_SIZE=1

SAVE_FREQ=3000
TEST_FREQ=10
TOTAL_EPOCHS=15

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=$TRAIN_FILES \
    data.val_files=$VAL_FILES \
    data.train_batch_size=$TRAIN_BATCH_SIZE \
    data.max_prompt_length=$MAX_PROMPT_LENGTH \
    data.max_response_length=$MAX_RESPONSE_LENGTH \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    data.image_key=images \
    actor_rollout_ref.model.path=$MODEL_PATH \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.model.use_fused_kernels=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=$PPO_MINI_BATCH_SIZE \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=$PPO_MICRO_BATCH_SIZE_PER_GPU \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.01 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=$LOG_PROB_MICRO_BATCH_SIZE_PER_GPU \
    actor_rollout_ref.rollout.tensor_model_parallel_size=$TENSOR_MODEL_PARALLEL_SIZE \
    actor_rollout_ref.rollout.name=$ENGINE \
    +actor_rollout_ref.rollout.engine_kwargs.vllm.disable_mm_preprocessor_cache=True \
    actor_rollout_ref.rollout.gpu_memory_utilization=$GPU_UTILIZATION \
    actor_rollout_ref.rollout.enable_chunked_prefill=True \
    actor_rollout_ref.rollout.enforce_eager=False \
    actor_rollout_ref.rollout.free_cache_engine=True \
    actor_rollout_ref.rollout.n=$ROLLOUT_N \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=$LOG_PROB_MICRO_BATCH_SIZE_PER_GPU \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    reward_model.reward_manager=batch \
    custom_reward_function.path=./reward/safetygrpo_qwen3.py \
    custom_reward_function.name=compute_score_batch \
    trainer.critic_warmup=0 \
    trainer.logger=wandb \
    trainer.project_name=$PROJECT_NAME \
    trainer.experiment_name=$EXPERIMENT_NAME \
    trainer.n_gpus_per_node=4 \
    trainer.nnodes=1 \
    trainer.save_freq=$SAVE_FREQ \
    trainer.test_freq=$TEST_FREQ \
    trainer.total_epochs=$TOTAL_EPOCHS \
    trainer.default_local_dir=./checkpoints/$PROJECT_NAME/$EXPERIMENT_NAME $@
Downloads last month
18
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wgyhhh/Qwen3-VL-4B-Thinking-SafeGRPO

Finetuned
(23)
this model

Dataset used to train wgyhhh/Qwen3-VL-4B-Thinking-SafeGRPO