YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Qwen2.5-0.5B Truthy DPO

This model is a DPO fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct trained on the jondurbin/truthy-dpo-v0.1 dataset for NLP-Assignment5 on human preference optimization and LLM-as-a-Judge evaluation.

Base model

  • Qwen/Qwen2.5-0.5B-Instruct

Training dataset

  • jondurbin/truthy-dpo-v0.1

Training method

  • Direct Preference Optimization (DPO)
  • LoRA fine-tuning
  • 4-bit quantization
  • Google Colab T4 GPU

Key hyperparameters

  • epochs: 1
  • learning rate: 1e-5
  • per-device batch size: 1
  • gradient accumulation steps: 8
  • max length: 384
  • beta: 0.1

Training result

  • final training loss: 0.5941
  • eval loss at step 40: 0.6167
  • eval loss at step 80: 0.5661

Intended use

This model was trained for educational purposes as part of an NLP assignment on preference optimization and factuality alignment.

Limitations

This model may still hallucinate or produce incorrect outputs. It has only been trained on a relatively small preference dataset and should not be used in high-stakes settings.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support