dpo-qwen-cot-merged

This repository provides a multi-stage fine-tuned version of Qwen3-4B-Instruct-2507.

The training pipeline consists of:

Supervised Fine-Tuning (SFT)
Stage-2 Hard SFT refinement
Direct Preference Optimization (DPO)

The LoRA adapters have been merged into the base model. This repository contains the final merged full-precision weights.

Training Pipeline

Stage 1 — Supervised Fine-Tuning (SFT)

Base model: Qwen/Qwen3-4B-Instruct-2507
Dataset: u-10bei/structured_data_with_cot_dataset_512_v5

Configuration:

Method: QLoRA (4-bit, Unsloth)
LoRA: r=64, alpha=128
Max sequence length: 512
Epochs: 2
Learning rate: 1e-4
Batch size: 2
Gradient accumulation: 8
Warmup ratio: 0.05
Weight decay: 0.0
Seed: 3407
CoT masking: Enabled (loss applied only to final outputs)

Stage 2 — Hard Data Refinement

Dataset: daichira/structured-hard-sft-4k

Configuration:

Epochs: 1
Learning rate: 3e-5
Same LoRA configuration as Stage 1

This stage improves robustness on difficult structured transformation tasks.

Stage 3 — Direct Preference Optimization (DPO)

Dataset: u-10bei/dpo-dataset-qwen-cot

Configuration:

Method: DPO via TRL + Unsloth
LoRA: r=8, alpha=16
Learning rate: 1e-7
Beta: 0.1
Max sequence length: 1024
Max prompt length: 512
Epochs: 1
Optimizer: adamw_8bit
Batch size: 2
Gradient accumulation: 4
Warmup ratio: 0.1
Weight decay: 0.01
Seed: 42

The objective is to align the model toward preferred Chain-of-Thought reasoning patterns using (prompt, chosen, rejected) preference data.

Merge Status

All LoRA adapters have been merged into the base model.

No PEFT loading is required.

Intended Use

This model is designed for:

Structured transformation tasks
Chain-of-Thought reasoning
Preference-aligned generation
Academic research experiments
Competition submission

Research Notes

This work explores multi-stage fine-tuning combining:

Structured SFT with CoT masking
Hard data refinement
Preference-based alignment via DPO

The training was performed using the Unsloth library for memory-efficient 4-bit fine-tuning.

License

This model follows the license of the base model:

Qwen/Qwen3-4B-Instruct-2507

Users must comply with the original base model license.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "HuiyuWang/dpo-qwen-cot-merged"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

prompt = "Solve the following problem step by step: ..."
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Downloads last month: 73

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for HuiyuWang/dpo-qwen-cot-merged

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

(1632)

this model

HuiyuWang
/

dpo-qwen-cot-merged

dpo-qwen-cot-merged

Training Pipeline

Stage 1 — Supervised Fine-Tuning (SFT)

Stage 2 — Hard Data Refinement

Stage 3 — Direct Preference Optimization (DPO)

Merge Status

Intended Use

Research Notes

License

Usage

Model tree for HuiyuWang/dpo-qwen-cot-merged

Datasets used to train HuiyuWang/dpo-qwen-cot-merged