YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Qwen2.5-0.5B Truthy DPO
This model is a DPO fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct trained on the jondurbin/truthy-dpo-v0.1 dataset for NLP-Assignment5 on human preference optimization and LLM-as-a-Judge evaluation.
Base model
- Qwen/Qwen2.5-0.5B-Instruct
Training dataset
- jondurbin/truthy-dpo-v0.1
Training method
- Direct Preference Optimization (DPO)
- LoRA fine-tuning
- 4-bit quantization
- Google Colab T4 GPU
Key hyperparameters
- epochs: 1
- learning rate: 1e-5
- per-device batch size: 1
- gradient accumulation steps: 8
- max length: 384
- beta: 0.1
Training result
- final training loss: 0.5941
- eval loss at step 40: 0.6167
- eval loss at step 80: 0.5661
Intended use
This model was trained for educational purposes as part of an NLP assignment on preference optimization and factuality alignment.
Limitations
This model may still hallucinate or produce incorrect outputs. It has only been trained on a relatively small preference dataset and should not be used in high-stakes settings.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support