GPT-2 Fine-tuned with Direct Preference Optimization (DPO)

A GPT-2 model fine-tuned from scratch using Direct Preference Optimization (DPO) on a subset of the UltraFeedback dataset. The entire training pipeline — data preparation, log-probability computation, DPO loss, and training loop — was implemented from scratch without any high-level training wrappers.

Model Description

  • Model type: Causal Language Model (GPT-2) fine-tuned with DPO
  • Language: English
  • License: MIT
  • Finetuned from: openai-community/gpt2

What is DPO?

Direct Preference Optimization is a modern alignment technique that teaches a language model to prefer certain responses over others using human preference data. Unlike standard RLHF — which requires a separate reward model and reinforcement learning — DPO collapses both into a single supervised loss:

L_DPO = -log σ( β * [(log π_θ(yw|x) - log π_θ(yl|x)) - (log π_ref(yw|x) - log π_ref(yl|x))] )

where yw is the chosen response, yl is the rejected response, π_θ is the policy being trained, and π_ref is a frozen reference model.

How to Get Started

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("your-username/gpt2-dpo-ultrafeedback")
tokenizer = AutoTokenizer.from_pretrained("your-username/gpt2-dpo-ultrafeedback")

prompt = "User: What is the capital of France?\nAssistant: "
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=64, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

A subset of UltraFeedback, a large-scale preference dataset of (prompt, chosen, rejected) triples covering diverse instruction-following tasks.

Training Procedure

The policy model starts from GPT-2 weights. A frozen copy serves as the reference model throughout training. The DPO loss encourages the policy to assign higher relative probability to chosen responses compared to rejected ones, anchored by the reference.

Training Hyperparameters

Parameter Value
Base model gpt2
β (beta) 0.1
Max prompt tokens 96
Max response tokens 64
Max total length 160
Batch size 4
Gradient accumulation steps 4
Effective batch size 16
Learning rate 1e-5
Optimizer AdamW
Gradient clipping 1.0
Epochs 1
Seed 13

Implementation

Every component was implemented from scratch in pure PyTorch:

  • Causal LM sequence building with prompt masking
  • Preference example tokenization with configurable length budgets
  • Sequence-level log-probability computation with causal shift
  • DPO loss with implicit reward tracking
  • Training loop with gradient accumulation and clipping
  • Evaluation loop with preference accuracy metric

Evaluation

Preference accuracy measures how often the policy assigns higher log-probability to the chosen response than the rejected response, relative to the reference model:

accuracy = mean[ (log π_θ(yw|x) - log π_θ(yl|x)) > (log π_ref(yw|x) - log π_ref(yl|x)) ]

Citation

@article{rafailov2023direct,
  title={Direct Preference Optimization: Your Language Model is Secretly a Reward Model},
  author={Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Ermon, Stefano and Manning, Christopher D and Finn, Chelsea},
  journal={arXiv preprint arXiv:2305.18290},
  year={2023}
}

@article{cui2023ultrafeedback,
  title={UltraFeedback: Boosting Language Models with High-quality Feedback},
  author={Cui, Ganqu and Yuan, Lifan and Ding, Ning and Yao, Guanming and Zhu, Wei and Ni, Yuan and Xie, Guotong and Liu, Zhiyuan and Sun, Maosong},
  journal={arXiv preprint arXiv:2310.01377},
  year={2023}
}
Downloads last month
41
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cbyabush/gpt2-dpo-ultrafeedback

Finetuned
(2162)
this model

Dataset used to train cbyabush/gpt2-dpo-ultrafeedback

Papers for cbyabush/gpt2-dpo-ultrafeedback