Instructions to use MuhammadHamza33/medi-llama-dpo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MuhammadHamza33/medi-llama-dpo with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("question-answering", model="MuhammadHamza33/medi-llama-dpo")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("MuhammadHamza33/medi-llama-dpo") model = AutoModelForCausalLM.from_pretrained("MuhammadHamza33/medi-llama-dpo") - Notebooks
- Google Colab
- Kaggle
medi-LLaMA
TinyLlama-1.1B fine-tuned on medical question answering using Supervised Fine-Tuning (SFT) followed by Direct Preference Optimization (DPO) on the MedCRAFT dataset.
Model Details
Model Description
- Developed by: Muhammad Hamza Moosani, Hadiya Muneeb, Maryam Sultan
- Model type: Causal Language Model (decoder-only)
- Language: English
- License: MIT
- Base model:
TinyLlama/TinyLlama_v1.1(1.1B parameters) - Fine-tuning method: LoRA SFT → LoRA DPO (fully merged)
- Training data: MedCRAFT SFT + DPO splits
Model Sources
- Repository: LeonaKV18/Medi-LLaMA
- Demo: medi-llama-demo on HuggingFace Spaces
Uses
Direct Use
This model is intended for medical question answering in an educational or research context. It can answer clinical questions about symptoms, drug mechanisms, diagnostic criteria, and treatment protocols.
Out-of-Scope Use
This model should not be used as a substitute for professional medical advice, diagnosis, or treatment. It is a research prototype trained on a small dataset and will produce factual errors. Do not use for clinical decision-making.
How to Get Started
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained(
"MuhammadHamza33/medi-llama-dpo",
torch_dtype=torch.float32,
device_map='cpu'
)
tokenizer = AutoTokenizer.from_pretrained("MuhammadHamza33/medi-llama-dpo")
SYSTEM = (
"You are an experienced and knowledgeable medical professional. "
"Provide clear, factual, and helpful medical information."
)
question = "What are the early warning signs of Type 2 diabetes?"
prompt = f"<s>[INST] <<SYS>>\n{SYSTEM}\n<</SYS>>\n\n{question} [/INST]"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
out = model.generate(
**inputs,
max_new_tokens=300,
do_sample=False,
pad_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
Training Details
Training Data
SFT: sherry0213/MedCRAFT SFT split — 5,400 instruction-response pairs (90% of the first 6,000 rows after quality filtering). Each row has an instruction, response, and diff (difficulty) column.
DPO: sherry0213/MedCRAFT DPO split — 1,951 preference triples (prompt, chosen, rejected) after removing 4 identical chosen/rejected pairs.
Training Procedure
Stage 1 — SFT (best of 5 trials):
- LoRA rank 64, alpha 128, target modules: q/k/v/o_proj
- Learning rate: 5e-5, batch size 1 (effective 4), 1 epoch
- Scheduler: cosine with 5% warmup
- Precision: bfloat16, platform: Google Colab T4
Stage 2 — DPO (best of 5 trials):
- SFT adapter merged into base before DPO training
- β = 0.05, LoRA rank 32, alpha 64
- Learning rate: 3e-5, batch size 2 (effective 8), 1 epoch
- Reference model: adapter switching (ref_model=None)
- Precision: bfloat16, platform: Google Colab T4
This model is the fully merged result of base + SFT adapter + DPO adapter.
Evaluation
Metrics
- BLEU (SacreBLEU, effective_order=True): measures n-gram overlap with gold answers
- BERTScore F1 (RoBERTa-large): measures semantic similarity with gold answers
Results
Evaluated on 10 manually written clinical prompts with ChatGPT (GPT-4o) gold answers.
| Model | BLEU ↑ | BERTScore F1 ↑ | Val Loss ↓ |
|---|---|---|---|
| Base TinyLlama | 1.117 | 0.779 | — |
| After SFT (Trial 5) | 3.339 | 0.829 | 1.330 |
| After DPO (Trial 5) ⭐ | 3.009 | 0.856 | 0.463 |
DPO improved BERTScore by 3.2% over SFT. The slight BLEU decrease is expected — preference-tuned models paraphrase rather than mirror reference wording.
Bias, Risks, and Limitations
- Factual errors: The model makes drug classification errors (e.g. metformin mechanism) and incomplete clinical protocol descriptions. Do not rely on outputs for clinical decisions.
- Small dataset: Trained on ~5,400 SFT examples and ~1,951 preference pairs — far less than production medical AI systems.
- English only: Trained exclusively on English clinical text.
- 1.1B parameters: Significantly smaller than state-of-the-art medical LLMs. Knowledge is limited and responses may lack depth on complex clinical topics.
Environmental Impact
- Hardware: NVIDIA T4 GPU (Google Colab free tier)
- Total training time: ~730 minutes across all 10 trials (5 SFT + 5 DPO)
- Cloud provider: Google Colab
Citation
@misc{medillama2026,
title = {medi-LLaMA: SFT and DPO Fine-Tuning of TinyLlama for Medical QA},
author = {Moosani, Muhammad Hamza and Muneeb, Hadiya and Sultan, Maryam},
year = {2026},
howpublished = {\url{https://github.com/LeonaKV18/Medi-LLaMA}}
}
- Downloads last month
- 41
Model tree for MuhammadHamza33/medi-llama-dpo
Base model
TinyLlama/TinyLlama_v1.1