medi-LLaMA

TinyLlama-1.1B fine-tuned on medical question answering using Supervised Fine-Tuning (SFT) followed by Direct Preference Optimization (DPO) on the MedCRAFT dataset.

🚀 Live Demo · 💻 GitHub

Model Details

Model Description

Developed by: Muhammad Hamza Moosani, Hadiya Muneeb, Maryam Sultan
Model type: Causal Language Model (decoder-only)
Language: English
License: MIT
Base model: TinyLlama/TinyLlama_v1.1 (1.1B parameters)
Fine-tuning method: LoRA SFT → LoRA DPO (fully merged)
Training data: MedCRAFT SFT + DPO splits

Model Sources

Repository: LeonaKV18/Medi-LLaMA
Demo: medi-llama-demo on HuggingFace Spaces

Uses

Direct Use

This model is intended for medical question answering in an educational or research context. It can answer clinical questions about symptoms, drug mechanisms, diagnostic criteria, and treatment protocols.

Out-of-Scope Use

This model should not be used as a substitute for professional medical advice, diagnosis, or treatment. It is a research prototype trained on a small dataset and will produce factual errors. Do not use for clinical decision-making.

How to Get Started

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    "MuhammadHamza33/medi-llama-dpo",
    torch_dtype=torch.float32,
    device_map='cpu'
)
tokenizer = AutoTokenizer.from_pretrained("MuhammadHamza33/medi-llama-dpo")

SYSTEM = (
    "You are an experienced and knowledgeable medical professional. "
    "Provide clear, factual, and helpful medical information."
)

question = "What are the early warning signs of Type 2 diabetes?"
prompt = f"<s>[INST] <<SYS>>\n{SYSTEM}\n<</SYS>>\n\n{question} [/INST]"

inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=300,
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id,
    )
response = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)

Training Details

Training Data

SFT: sherry0213/MedCRAFT SFT split — 5,400 instruction-response pairs (90% of the first 6,000 rows after quality filtering). Each row has an instruction, response, and diff (difficulty) column.

DPO: sherry0213/MedCRAFT DPO split — 1,951 preference triples (prompt, chosen, rejected) after removing 4 identical chosen/rejected pairs.

Training Procedure

Stage 1 — SFT (best of 5 trials):

LoRA rank 64, alpha 128, target modules: q/k/v/o_proj
Learning rate: 5e-5, batch size 1 (effective 4), 1 epoch
Scheduler: cosine with 5% warmup
Precision: bfloat16, platform: Google Colab T4

Stage 2 — DPO (best of 5 trials):

SFT adapter merged into base before DPO training
β = 0.05, LoRA rank 32, alpha 64
Learning rate: 3e-5, batch size 2 (effective 8), 1 epoch
Reference model: adapter switching (ref_model=None)
Precision: bfloat16, platform: Google Colab T4

This model is the fully merged result of base + SFT adapter + DPO adapter.

Evaluation

Metrics

BLEU (SacreBLEU, effective_order=True): measures n-gram overlap with gold answers
BERTScore F1 (RoBERTa-large): measures semantic similarity with gold answers

Results

Evaluated on 10 manually written clinical prompts with ChatGPT (GPT-4o) gold answers.

Model	BLEU ↑	BERTScore F1 ↑	Val Loss ↓
Base TinyLlama	1.117	0.779	—
After SFT (Trial 5)	3.339	0.829	1.330
After DPO (Trial 5) ⭐	3.009	0.856	0.463

DPO improved BERTScore by 3.2% over SFT. The slight BLEU decrease is expected — preference-tuned models paraphrase rather than mirror reference wording.

Bias, Risks, and Limitations

Factual errors: The model makes drug classification errors (e.g. metformin mechanism) and incomplete clinical protocol descriptions. Do not rely on outputs for clinical decisions.
Small dataset: Trained on ~5,400 SFT examples and ~1,951 preference pairs — far less than production medical AI systems.
English only: Trained exclusively on English clinical text.
1.1B parameters: Significantly smaller than state-of-the-art medical LLMs. Knowledge is limited and responses may lack depth on complex clinical topics.

Environmental Impact

Hardware: NVIDIA T4 GPU (Google Colab free tier)
Total training time: ~730 minutes across all 10 trials (5 SFT + 5 DPO)
Cloud provider: Google Colab

Citation

@misc{medillama2026,
  title        = {medi-LLaMA: SFT and DPO Fine-Tuning of TinyLlama for Medical QA},
  author       = {Moosani, Muhammad Hamza and Muneeb, Hadiya and Sultan, Maryam},
  year         = {2026},
  howpublished = {\url{https://github.com/LeonaKV18/Medi-LLaMA}}
}

Downloads last month: 41

Safetensors

Model size

1B params

Tensor type

BF16

Model tree for MuhammadHamza33/medi-llama-dpo

Base model

TinyLlama/TinyLlama_v1.1

Adapter

(315)

this model

MuhammadHamza33
/

medi-llama-dpo