dpo-qwen-cot-merged
This repository provides a multi-stage fine-tuned version of Qwen3-4B-Instruct-2507.
The training pipeline consists of:
- Supervised Fine-Tuning (SFT)
- Stage-2 Hard SFT refinement
- Direct Preference Optimization (DPO)
The LoRA adapters have been merged into the base model. This repository contains the final merged full-precision weights.
Training Pipeline
Stage 1 — Supervised Fine-Tuning (SFT)
Base model: Qwen/Qwen3-4B-Instruct-2507
Dataset: u-10bei/structured_data_with_cot_dataset_512_v5
Configuration:
- Method: QLoRA (4-bit, Unsloth)
- LoRA: r=64, alpha=128
- Max sequence length: 512
- Epochs: 2
- Learning rate: 1e-4
- Batch size: 2
- Gradient accumulation: 8
- Warmup ratio: 0.05
- Weight decay: 0.0
- Seed: 3407
- CoT masking: Enabled (loss applied only to final outputs)
Stage 2 — Hard Data Refinement
Dataset: daichira/structured-hard-sft-4k
Configuration:
- Epochs: 1
- Learning rate: 3e-5
- Same LoRA configuration as Stage 1
This stage improves robustness on difficult structured transformation tasks.
Stage 3 — Direct Preference Optimization (DPO)
Dataset: u-10bei/dpo-dataset-qwen-cot
Configuration:
- Method: DPO via TRL + Unsloth
- LoRA: r=8, alpha=16
- Learning rate: 1e-7
- Beta: 0.1
- Max sequence length: 1024
- Max prompt length: 512
- Epochs: 1
- Optimizer: adamw_8bit
- Batch size: 2
- Gradient accumulation: 4
- Warmup ratio: 0.1
- Weight decay: 0.01
- Seed: 42
The objective is to align the model toward preferred Chain-of-Thought reasoning patterns using (prompt, chosen, rejected) preference data.
Merge Status
All LoRA adapters have been merged into the base model.
No PEFT loading is required.
Intended Use
This model is designed for:
- Structured transformation tasks
- Chain-of-Thought reasoning
- Preference-aligned generation
- Academic research experiments
- Competition submission
Research Notes
This work explores multi-stage fine-tuning combining:
- Structured SFT with CoT masking
- Hard data refinement
- Preference-based alignment via DPO
The training was performed using the Unsloth library for memory-efficient 4-bit fine-tuning.
License
This model follows the license of the base model:
Qwen/Qwen3-4B-Instruct-2507
Users must comply with the original base model license.
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "HuiyuWang/dpo-qwen-cot-merged"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
prompt = "Solve the following problem step by step: ..."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 73
Model tree for HuiyuWang/dpo-qwen-cot-merged
Base model
Qwen/Qwen3-4B-Instruct-2507