Instructions to use cbyabush/gpt2-dpo-ultrafeedback with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use cbyabush/gpt2-dpo-ultrafeedback with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="cbyabush/gpt2-dpo-ultrafeedback")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("cbyabush/gpt2-dpo-ultrafeedback") model = AutoModelForCausalLM.from_pretrained("cbyabush/gpt2-dpo-ultrafeedback") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use cbyabush/gpt2-dpo-ultrafeedback with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "cbyabush/gpt2-dpo-ultrafeedback" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cbyabush/gpt2-dpo-ultrafeedback", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/cbyabush/gpt2-dpo-ultrafeedback
- SGLang
How to use cbyabush/gpt2-dpo-ultrafeedback with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "cbyabush/gpt2-dpo-ultrafeedback" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cbyabush/gpt2-dpo-ultrafeedback", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "cbyabush/gpt2-dpo-ultrafeedback" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cbyabush/gpt2-dpo-ultrafeedback", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use cbyabush/gpt2-dpo-ultrafeedback with Docker Model Runner:
docker model run hf.co/cbyabush/gpt2-dpo-ultrafeedback
GPT-2 Fine-tuned with Direct Preference Optimization (DPO)
A GPT-2 model fine-tuned from scratch using Direct Preference Optimization (DPO) on a subset of the UltraFeedback dataset. The entire training pipeline — data preparation, log-probability computation, DPO loss, and training loop — was implemented from scratch without any high-level training wrappers.
Model Description
- Model type: Causal Language Model (GPT-2) fine-tuned with DPO
- Language: English
- License: MIT
- Finetuned from: openai-community/gpt2
What is DPO?
Direct Preference Optimization is a modern alignment technique that teaches a language model to prefer certain responses over others using human preference data. Unlike standard RLHF — which requires a separate reward model and reinforcement learning — DPO collapses both into a single supervised loss:
L_DPO = -log σ( β * [(log π_θ(yw|x) - log π_θ(yl|x)) - (log π_ref(yw|x) - log π_ref(yl|x))] )
where yw is the chosen response, yl is the rejected response, π_θ is the policy being trained, and π_ref is a frozen reference model.
How to Get Started
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("your-username/gpt2-dpo-ultrafeedback")
tokenizer = AutoTokenizer.from_pretrained("your-username/gpt2-dpo-ultrafeedback")
prompt = "User: What is the capital of France?\nAssistant: "
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=64, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
Training Data
A subset of UltraFeedback, a large-scale preference dataset of (prompt, chosen, rejected) triples covering diverse instruction-following tasks.
Training Procedure
The policy model starts from GPT-2 weights. A frozen copy serves as the reference model throughout training. The DPO loss encourages the policy to assign higher relative probability to chosen responses compared to rejected ones, anchored by the reference.
Training Hyperparameters
| Parameter | Value |
|---|---|
| Base model | gpt2 |
| β (beta) | 0.1 |
| Max prompt tokens | 96 |
| Max response tokens | 64 |
| Max total length | 160 |
| Batch size | 4 |
| Gradient accumulation steps | 4 |
| Effective batch size | 16 |
| Learning rate | 1e-5 |
| Optimizer | AdamW |
| Gradient clipping | 1.0 |
| Epochs | 1 |
| Seed | 13 |
Implementation
Every component was implemented from scratch in pure PyTorch:
- Causal LM sequence building with prompt masking
- Preference example tokenization with configurable length budgets
- Sequence-level log-probability computation with causal shift
- DPO loss with implicit reward tracking
- Training loop with gradient accumulation and clipping
- Evaluation loop with preference accuracy metric
Evaluation
Preference accuracy measures how often the policy assigns higher log-probability to the chosen response than the rejected response, relative to the reference model:
accuracy = mean[ (log π_θ(yw|x) - log π_θ(yl|x)) > (log π_ref(yw|x) - log π_ref(yl|x)) ]
Citation
@article{rafailov2023direct,
title={Direct Preference Optimization: Your Language Model is Secretly a Reward Model},
author={Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Ermon, Stefano and Manning, Christopher D and Finn, Chelsea},
journal={arXiv preprint arXiv:2305.18290},
year={2023}
}
@article{cui2023ultrafeedback,
title={UltraFeedback: Boosting Language Models with High-quality Feedback},
author={Cui, Ganqu and Yuan, Lifan and Ding, Ning and Yao, Guanming and Zhu, Wei and Ni, Yuan and Xie, Guotong and Liu, Zhiyuan and Sun, Maosong},
journal={arXiv preprint arXiv:2310.01377},
year={2023}
}
- Downloads last month
- 41
Model tree for cbyabush/gpt2-dpo-ultrafeedback
Base model
openai-community/gpt2