YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
MedGemma 4B on DGX Spark β Blackwell Inference Benchmark
Key Finding
NF4 quantization on NVIDIA Blackwell (GB10) delivers 39.8 tok/s at 3.5 GB VRAM β nearly 2x faster than full-precision bf16, with zero quality degradation.
This makes real-time medical AI inference viable on consumer-grade Blackwell hardware.
Benchmark Results
| Configuration | Throughput | VRAM | Model Size | Hardware |
|---|---|---|---|---|
| NF4 (bitsandbytes) | 39.8 tok/s | 3.5 GB | ~2.5 GB | DGX Spark GB10 |
| bf16 (full precision) | 20.5 tok/s | 8.6 GB | ~8 GB | DGX Spark GB10 |
| Q4_K_M GGUF (CPU) | 12.3 tok/s | ~4 GB RAM | 2.4 GB | Azure D4as_v5 (4-core EPYC) |
Speedup Summary
- NF4 vs bf16: 1.94x faster, 59% less VRAM
- NF4 vs CPU GGUF: 3.2x faster (GPU vs 4-core CPU)
- All configurations produce identical medical response quality
Hardware
DGX Spark (GB10 Blackwell)
- GPU: NVIDIA GB10, compute capability 12.1
- Memory: 128 GB unified (shared CPU/GPU)
- Native bf16 tensor cores + FP4 hardware decode
Azure D4as_v5 (CPU baseline)
- CPU: AMD EPYC 7763, 4 vCPUs (2 cores Γ 2 threads)
- RAM: 16 GB
Methodology
Model
- Base: google/medgemma-4b-it
- Quantized: unsloth/medgemma-4b-it-GGUF (Q4_K_M)
Benchmark Protocol
- Warmup: 1 short generation discarded
- Prompts: 3-5 medical questions (malaria symptoms, diarrhea treatment, diabetes types, preeclampsia, ORT protocol)
- Generation:
max_new_tokens=200,temperature=0.3,do_sample=True - Timing:
torch.cuda.synchronize()before/after, wall-clock for CPU - Metrics: tokens generated / wall-clock time = tok/s
Quality Verification
All configurations produce medically accurate, well-structured responses. Example (malaria symptoms):
"The symptoms of malaria can vary depending on the type of malaria parasite, the severity of the infection, and the individual's immune response. Common symptoms include fever, chills, headache..."
No hallucinations, degradation, or truncation observed across quantization levels.
How to Reproduce
NF4 (fastest β recommended)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("google/medgemma-4b-it")
model = AutoModelForCausalLM.from_pretrained(
"google/medgemma-4b-it",
quantization_config=quantization_config,
device_map="auto"
)
messages = [{"role": "user", "content": "What are the symptoms of malaria?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
with torch.no_grad():
outputs = model.generate(inputs, max_new_tokens=200, temperature=0.3, do_sample=True)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
bf16 (full precision)
model = AutoModelForCausalLM.from_pretrained(
"google/medgemma-4b-it",
torch_dtype=torch.bfloat16,
device_map="auto"
)
GGUF (CPU, via llama-server)
# Install llama.cpp
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build && cmake --build build -j$(nproc) --target llama-server
# Download model
pip install huggingface_hub
python3 -c "from huggingface_hub import hf_hub_download; hf_hub_download('unsloth/medgemma-4b-it-GGUF', 'medgemma-4b-it-Q4_K_M.gguf', local_dir='models')"
# Serve
./build/bin/llama-server -m models/medgemma-4b-it-Q4_K_M.gguf -c 2048 -t 4 --port 8080
# Query
curl http://localhost:8080/v1/chat/completions \
-d '{"messages":[{"role":"user","content":"What is malaria?"}],"max_tokens":200}'
Citation
@misc{craneailabs2026medgemma,
title={MedGemma 4B Blackwell Inference Benchmark},
author={Crane AI Labs},
year={2026},
url={https://huggingface.co/CraneAILabs/medgemma-blackwell-benchmark}
}
About
Crane AI Labs builds AI for African languages and healthcare.