MedGemma 4B on DGX Spark — Blackwell Inference Benchmark

Key Finding

NF4 quantization on NVIDIA Blackwell (GB10) delivers 39.8 tok/s at 3.5 GB VRAM — nearly 2x faster than full-precision bf16, with zero quality degradation.

This makes real-time medical AI inference viable on consumer-grade Blackwell hardware.

Benchmark Results

Configuration	Throughput	VRAM	Model Size	Hardware
NF4 (bitsandbytes)	39.8 tok/s	3.5 GB	~2.5 GB	DGX Spark GB10
bf16 (full precision)	20.5 tok/s	8.6 GB	~8 GB	DGX Spark GB10
Q4_K_M GGUF (CPU)	12.3 tok/s	~4 GB RAM	2.4 GB	Azure D4as_v5 (4-core EPYC)

Speedup Summary

NF4 vs bf16: 1.94x faster, 59% less VRAM
NF4 vs CPU GGUF: 3.2x faster (GPU vs 4-core CPU)
All configurations produce identical medical response quality

Hardware

DGX Spark (GB10 Blackwell)

GPU: NVIDIA GB10, compute capability 12.1
Memory: 128 GB unified (shared CPU/GPU)
Native bf16 tensor cores + FP4 hardware decode

Azure D4as_v5 (CPU baseline)

CPU: AMD EPYC 7763, 4 vCPUs (2 cores × 2 threads)
RAM: 16 GB

Methodology

Model

Base: google/medgemma-4b-it
Quantized: unsloth/medgemma-4b-it-GGUF (Q4_K_M)

Benchmark Protocol

Warmup: 1 short generation discarded
Prompts: 3-5 medical questions (malaria symptoms, diarrhea treatment, diabetes types, preeclampsia, ORT protocol)
Generation: max_new_tokens=200, temperature=0.3, do_sample=True
Timing: torch.cuda.synchronize() before/after, wall-clock for CPU
Metrics: tokens generated / wall-clock time = tok/s

Quality Verification

All configurations produce medically accurate, well-structured responses. Example (malaria symptoms):

"The symptoms of malaria can vary depending on the type of malaria parasite, the severity of the infection, and the individual's immune response. Common symptoms include fever, chills, headache..."

No hallucinations, degradation, or truncation observed across quantization levels.

How to Reproduce

NF4 (fastest — recommended)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained("google/medgemma-4b-it")
model = AutoModelForCausalLM.from_pretrained(
    "google/medgemma-4b-it",
    quantization_config=quantization_config,
    device_map="auto"
)

messages = [{"role": "user", "content": "What are the symptoms of malaria?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)

with torch.no_grad():
    outputs = model.generate(inputs, max_new_tokens=200, temperature=0.3, do_sample=True)

print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

bf16 (full precision)

model = AutoModelForCausalLM.from_pretrained(
    "google/medgemma-4b-it",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

GGUF (CPU, via llama-server)

# Install llama.cpp
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build && cmake --build build -j$(nproc) --target llama-server

# Download model
pip install huggingface_hub
python3 -c "from huggingface_hub import hf_hub_download; hf_hub_download('unsloth/medgemma-4b-it-GGUF', 'medgemma-4b-it-Q4_K_M.gguf', local_dir='models')"

# Serve
./build/bin/llama-server -m models/medgemma-4b-it-Q4_K_M.gguf -c 2048 -t 4 --port 8080

# Query
curl http://localhost:8080/v1/chat/completions \
  -d '{"messages":[{"role":"user","content":"What is malaria?"}],"max_tokens":200}'

Citation

@misc{craneailabs2026medgemma,
  title={MedGemma 4B Blackwell Inference Benchmark},
  author={Crane AI Labs},
  year={2026},
  url={https://huggingface.co/CraneAILabs/medgemma-blackwell-benchmark}
}

About

Crane AI Labs builds AI for African languages and healthcare.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support