Instructions to use cbyabush/gpt2-dpo-ultrafeedback with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use cbyabush/gpt2-dpo-ultrafeedback with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="cbyabush/gpt2-dpo-ultrafeedback")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("cbyabush/gpt2-dpo-ultrafeedback")
model = AutoModelForCausalLM.from_pretrained("cbyabush/gpt2-dpo-ultrafeedback")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use cbyabush/gpt2-dpo-ultrafeedback with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "cbyabush/gpt2-dpo-ultrafeedback"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cbyabush/gpt2-dpo-ultrafeedback",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/cbyabush/gpt2-dpo-ultrafeedback

SGLang

How to use cbyabush/gpt2-dpo-ultrafeedback with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "cbyabush/gpt2-dpo-ultrafeedback" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cbyabush/gpt2-dpo-ultrafeedback",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "cbyabush/gpt2-dpo-ultrafeedback" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cbyabush/gpt2-dpo-ultrafeedback",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use cbyabush/gpt2-dpo-ultrafeedback with Docker Model Runner:
```
docker model run hf.co/cbyabush/gpt2-dpo-ultrafeedback
```

GPT-2 Fine-tuned with Direct Preference Optimization (DPO)

A GPT-2 model fine-tuned from scratch using Direct Preference Optimization (DPO) on a subset of the UltraFeedback dataset. The entire training pipeline — data preparation, log-probability computation, DPO loss, and training loop — was implemented from scratch without any high-level training wrappers.

Model Description

Model type: Causal Language Model (GPT-2) fine-tuned with DPO
Language: English
License: MIT
Finetuned from: openai-community/gpt2

What is DPO?

Direct Preference Optimization is a modern alignment technique that teaches a language model to prefer certain responses over others using human preference data. Unlike standard RLHF — which requires a separate reward model and reinforcement learning — DPO collapses both into a single supervised loss:

L_DPO = -log σ( β * [(log π_θ(yw|x) - log π_θ(yl|x)) - (log π_ref(yw|x) - log π_ref(yl|x))] )

where yw is the chosen response, yl is the rejected response, π_θ is the policy being trained, and π_ref is a frozen reference model.

How to Get Started

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("your-username/gpt2-dpo-ultrafeedback")
tokenizer = AutoTokenizer.from_pretrained("your-username/gpt2-dpo-ultrafeedback")

prompt = "User: What is the capital of France?\nAssistant: "
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=64, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

A subset of UltraFeedback, a large-scale preference dataset of (prompt, chosen, rejected) triples covering diverse instruction-following tasks.

Training Procedure

The policy model starts from GPT-2 weights. A frozen copy serves as the reference model throughout training. The DPO loss encourages the policy to assign higher relative probability to chosen responses compared to rejected ones, anchored by the reference.

Training Hyperparameters

Parameter	Value
Base model	gpt2
β (beta)	0.1
Max prompt tokens	96
Max response tokens	64
Max total length	160
Batch size	4
Gradient accumulation steps	4
Effective batch size	16
Learning rate	1e-5
Optimizer	AdamW
Gradient clipping	1.0
Epochs	1
Seed	13

Implementation

Every component was implemented from scratch in pure PyTorch:

Causal LM sequence building with prompt masking
Preference example tokenization with configurable length budgets
Sequence-level log-probability computation with causal shift
DPO loss with implicit reward tracking
Training loop with gradient accumulation and clipping
Evaluation loop with preference accuracy metric

Evaluation

Preference accuracy measures how often the policy assigns higher log-probability to the chosen response than the rejected response, relative to the reference model:

accuracy = mean[ (log π_θ(yw|x) - log π_θ(yl|x)) > (log π_ref(yw|x) - log π_ref(yl|x)) ]

Citation

@article{rafailov2023direct,
  title={Direct Preference Optimization: Your Language Model is Secretly a Reward Model},
  author={Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Ermon, Stefano and Manning, Christopher D and Finn, Chelsea},
  journal={arXiv preprint arXiv:2305.18290},
  year={2023}
}

@article{cui2023ultrafeedback,
  title={UltraFeedback: Boosting Language Models with High-quality Feedback},
  author={Cui, Ganqu and Yuan, Lifan and Ding, Ning and Yao, Guanming and Zhu, Wei and Ni, Yuan and Xie, Guotong and Liu, Zhiyuan and Sun, Maosong},
  journal={arXiv preprint arXiv:2310.01377},
  year={2023}
}