Qwen2.5-0.5B Truthy DPO

This model is a DPO fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct trained on the jondurbin/truthy-dpo-v0.1 dataset for NLP-Assignment5 on human preference optimization and LLM-as-a-Judge evaluation.

Base model

Qwen/Qwen2.5-0.5B-Instruct

Training dataset

jondurbin/truthy-dpo-v0.1

Training method

Direct Preference Optimization (DPO)
LoRA fine-tuning
4-bit quantization
Google Colab T4 GPU

Key hyperparameters

epochs: 1
learning rate: 1e-5
per-device batch size: 1
gradient accumulation steps: 8
max length: 384
beta: 0.1

Training result

final training loss: 0.5941
eval loss at step 40: 0.6167
eval loss at step 80: 0.5661

Intended use

This model was trained for educational purposes as part of an NLP assignment on preference optimization and factuality alignment.

Limitations

This model may still hallucinate or produce incorrect outputs. It has only been trained on a relatively small preference dataset and should not be used in high-stakes settings.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support