dpo-qwen-cot-merged

This repository provides a multi-stage fine-tuned version of Qwen3-4B-Instruct-2507.

The training pipeline consists of:

  1. Supervised Fine-Tuning (SFT)
  2. Stage-2 Hard SFT refinement
  3. Direct Preference Optimization (DPO)

The LoRA adapters have been merged into the base model. This repository contains the final merged full-precision weights.


Training Pipeline

Stage 1 — Supervised Fine-Tuning (SFT)

Base model: Qwen/Qwen3-4B-Instruct-2507
Dataset: u-10bei/structured_data_with_cot_dataset_512_v5

Configuration:

  • Method: QLoRA (4-bit, Unsloth)
  • LoRA: r=64, alpha=128
  • Max sequence length: 512
  • Epochs: 2
  • Learning rate: 1e-4
  • Batch size: 2
  • Gradient accumulation: 8
  • Warmup ratio: 0.05
  • Weight decay: 0.0
  • Seed: 3407
  • CoT masking: Enabled (loss applied only to final outputs)

Stage 2 — Hard Data Refinement

Dataset: daichira/structured-hard-sft-4k

Configuration:

  • Epochs: 1
  • Learning rate: 3e-5
  • Same LoRA configuration as Stage 1

This stage improves robustness on difficult structured transformation tasks.


Stage 3 — Direct Preference Optimization (DPO)

Dataset: u-10bei/dpo-dataset-qwen-cot

Configuration:

  • Method: DPO via TRL + Unsloth
  • LoRA: r=8, alpha=16
  • Learning rate: 1e-7
  • Beta: 0.1
  • Max sequence length: 1024
  • Max prompt length: 512
  • Epochs: 1
  • Optimizer: adamw_8bit
  • Batch size: 2
  • Gradient accumulation: 4
  • Warmup ratio: 0.1
  • Weight decay: 0.01
  • Seed: 42

The objective is to align the model toward preferred Chain-of-Thought reasoning patterns using (prompt, chosen, rejected) preference data.


Merge Status

All LoRA adapters have been merged into the base model.

No PEFT loading is required.


Intended Use

This model is designed for:

  • Structured transformation tasks
  • Chain-of-Thought reasoning
  • Preference-aligned generation
  • Academic research experiments
  • Competition submission

Research Notes

This work explores multi-stage fine-tuning combining:

  • Structured SFT with CoT masking
  • Hard data refinement
  • Preference-based alignment via DPO

The training was performed using the Unsloth library for memory-efficient 4-bit fine-tuning.

License

This model follows the license of the base model:

Qwen/Qwen3-4B-Instruct-2507

Users must comply with the original base model license.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "HuiyuWang/dpo-qwen-cot-merged"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

prompt = "Solve the following problem step by step: ..."
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
73
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for HuiyuWang/dpo-qwen-cot-merged

Finetuned
(1632)
this model

Datasets used to train HuiyuWang/dpo-qwen-cot-merged