MedPsy-1.7B-GGUF

MedPsy-1.7B-GGUF provides GGUF weights of MedPsy-1.7B for fast, fully on-device inference via llama.cpp and the QVAC SDK. An unquantized BF16 GGUF file (about 4.07 GB) is included alongside seven quantization formats, ranging from near-lossless 8-bit (about 2.16 GB) through a high-quality 5-bit option (about 1.47 GB) down to ultra-compact 3-bit (about 0.89 GB), making the same model deployable across everything from laptops to entry-level smartphones.

Developed by Tether AI Research
Model type Text-only causal language model (decoder-only transformer), GGUF quantized
Base (BF16) model MedPsy-1.7B
Backbone Qwen3-1.7B (Thinking)
Language English
License Apache 2.0
Quantization tool llama.cpp
Technical report MedPsy Technical Report
Collection MedPsy on Hugging Face
All MedPsy variants MedPsy-4B · MedPsy-1.7B · MedPsy-4B-GGUF · MedPsy-1.7B-GGUF

Available Files

All published files are produced with llama.cpp. The BF16 GGUF file is unquantized: no quantization is applied. We have not separately re-evaluated the BF16 GGUF with llama.cpp; because it preserves the same BF16 tensor precision as the source checkpoint, performance is expected to match the BF16 source model evaluated with vLLM, aside from small backend or runtime differences. Q8_0 does not use imatrix calibration (we verified that imatrix provided no measurable benefit at 8-bit). All sub-8-bit variants use importance-matrix (imatrix) calibration, which consistently reduces quality degradation. See the MedPsy Technical Report (Section 4.7) for the full quantization methodology, including the K-quants vs I-quants comparison and the per-bit-count imatrix ablation.

File Format Imatrix Size Δ Size Δ AVG (pts) Δ AVG (rel %) Recommended For
medpsy-1.7b-bf16.gguf BF16 n/a 4.07 GB 0% ≈0.00 ≈0.00% Unquantized GGUF (same performance expected)
medpsy-1.7b-q8_0.gguf Q8_0 no (not needed) 2.16 GB -47% 0.00 0.00% Best quality, near-lossless
medpsy-1.7b-q5_k_m-imat.gguf Q5_K_M yes 1.47 GB -64% -0.02 -0.03% Recommended high-quality 5-bit option
medpsy-1.7b-q4_k_m-imat.gguf Q4_K_M yes 1.28 GB -69% -0.73 -1.10% Recommended for smartphone (best size/quality trade-off)
medpsy-1.7b-iq4_nl-imat.gguf IQ4_NL yes 1.23 GB -70% -1.70 -2.56% Alternative 4-bit (slightly worse than Q4_K_M)
medpsy-1.7b-iq4_xs-imat.gguf IQ4_XS yes 1.18 GB -71% -1.79 -2.69% Smaller 4-bit alternative
medpsy-1.7b-iq3_m-imat.gguf IQ3_M yes 1.03 GB -75% -3.58 -5.40% ⚠ Notable degradation, expect more hallucinations - not recommended for medical use
medpsy-1.7b-iq3_xxs-imat.gguf IQ3_XXS yes 0.89 GB -78% -12.46 -18.78% ⚠ Severe degradation - not recommended

Two ways to read quality loss. Δ AVG (pts) is the absolute change in AVG Score vs the BF16/vLLM source-model baseline - the raw points lost. Δ AVG (rel %) is the relative change as a fraction of the baseline ((baseline − variant) / baseline). They convey complementary information: the absolute delta is the easiest "how much score did I actually lose?" reading, while the relative delta normalizes by baseline so quality degradation is comparable across models with different starting scores. AVG Score = mean of HealthBench Overall and Closed-Ended Average.

Quick Recommendation

Your constraint Choose
You want a llama.cpp-native unquantized file BF16 - no quantization applied; same quality as the source BF16 checkpoint expected, GGUF format (4.07 GB)
You want the best possible quality at smaller size Q8_0 - statistically indistinguishable from BF16, half the size
You want extra quality headroom over 4-bit Q5_K_M (imatrix) - 64% smaller than BF16, only -0.02 pts (-0.03% rel) on AVG Score
You want the best size/quality trade-off (most users) Q4_K_M (imatrix) - 69% smaller, only -0.73 pts (-1.10% rel) on AVG Score
You need the smallest 4-bit option IQ4_XS (imatrix) - 71% smaller, -1.79 pts (-2.69% rel) (1.18 GB)
You need around 1 GB IQ3_M, but be aware of important quality degradation and increased hallucination risk - not recommended for medical use

Benchmark Results

The comparison uses the BF16 source model evaluated with vLLM as the reference baseline. Quantized GGUF variants were evaluated on the full closed-ended benchmark suite (7 medical benchmarks, averaged) and HealthBench (CompassJudger-2-32B-Instruct as judge), using the same benchmark protocol. The BF16 GGUF file is the unquantized GGUF export and has not been separately re-run with llama.cpp; since no quantization is applied, its performance is expected to match the BF16 source model aside from small backend or runtime differences. AVG Score is the average of HealthBench Overall and Closed-Ended Average.

Variant Size (GB) HealthBench HB Hard CE Avg AVG Score Δ AVG (pts) Δ AVG (rel %) Δ Size
MedPsy-1.7B (BF16, vLLM baseline) 4.07 70 54 62.62 66.31 0.00 0.00% 0%
Q8_0 ★ 2.16 70 55 62.62 66.31 0.00 0.00% -47%
Q5_K_M ★ 1.47 70 55 62.58 66.29 -0.02 -0.03% -64%
Q4_K_M ★ 1.28 69 52 62.16 65.58 -0.73 -1.10% -69%
IQ4_NL 1.23 69 51 60.22 64.61 -1.70 -2.56% -70%
IQ4_XS 1.18 69 53 60.05 64.53 -1.79 -2.69% -72%
IQ3_M ⚠ 1.03 67 49 58.46 62.73 -3.58 -5.40% -75%
IQ3_XXS ⚠ 0.89 59 40 48.71 53.86 -12.46 -18.78% -79%

HB = HealthBench; CE = Closed-Ended Average (7 medical benchmarks). Δ AVG (pts) is the absolute point change in AVG Score vs the BF16/vLLM source-model baseline (e.g. 66.31 - 65.58 = -0.73). Δ AVG (rel %) is the relative change as a fraction of the baseline (e.g. -0.73 / 66.31 = -1.10%). Δ Size is the relative file-size change vs BF16. ★ Recommended variants (best of class). ⚠ Not recommended for medical use due to notable (IQ3_M) or severe (IQ3_XXS) quality degradation that increases hallucination risk. AVG Score = (HealthBench Overall + Closed-Ended Average) / 2. Results averaged over 3 runs with generation parameters: temperature=0.6, top_k=20, top_p=0.95, max_output_tokens=16384. Quantized GGUF variants were evaluated with llama.cpp; the BF16 baseline was evaluated with vLLM. HealthBench evaluated using CompassJudger-2-32B-Instruct.

Key Findings

  • Q8_0 is effectively lossless: identical AVG Score (66.31) to the BF16 baseline at 47% smaller (2.16 GB vs 4.07 GB), with no need for imatrix calibration.
  • Q5_K_M is a recommended high-quality option: -0.02 pts (-0.03% relative) AVG Score at 64% smaller (1.47 GB), while matching Q8_0 on HealthBench and HealthBench Hard.
  • Q4_K_M is the sweet spot: only -0.73 pts AVG Score loss (-1.10% relative) for a 69% size reduction, comfortably fitting in a typical mobile app's memory budget.
  • Smallest safe option is IQ4_XS (1.18 GB, -1.79 pts / -2.69% relative): the smallest variant we recommend for medical use on this 1.7B model.
  • 3-bit is too aggressive at 1.7B: IQ3_M loses -3.58 pts (-5.40% relative) and IQ3_XXS loses -12.46 pts (-18.78% relative). At this model size, 3-bit quantization noticeably degrades reasoning quality and increases the risk of hallucinations. We do not recommend either variant for medical applications - if you need an aggressively-quantized model, prefer the 4B GGUF at IQ3_M (2.13 GB), which is far more robust at low bit counts.
  • Calibration matters at low bit counts: imatrix calibration improves Q4_K_M closed-ended Average by +1.58 points; the effect grows at 3-bit quantizations.

How Quantized Variants Compare to Other Models

Even the most aggressive recommended quantization (Q4_K_M at 1.28 GB) retains a substantial accuracy lead over the unquantized open-weight baselines in this size class:

Model Size (GB) Closed-Ended Avg HealthBench
MedPsy-1.7B (BF16) 4.07 62.62 70
MedPsy-1.7B Q8_0 ★ 2.16 62.62 70
MedPsy-1.7B Q5_K_M ★ 1.47 62.58 70
MedPsy-1.7B Q4_K_M ★ 1.28 62.16 69
Qwen3-1.7B (Thinking) (BF16, backbone) 4.07 49.95 53
LFM2.5-1.2B-Thinking (BF16) 2.6 44.15 49

Usage

llama.cpp

# Download the recommended file (Q4_K_M with imatrix - best size/quality for smartphone)
huggingface-cli download qvac/MedPsy-1.7B-GGUF medpsy-1.7b-q4_k_m-imat.gguf --local-dir .

# Run interactively
./llama-cli -m medpsy-1.7b-q4_k_m-imat.gguf \
    -p "What are the common symptoms and first-line treatments for community-acquired pneumonia?" \
    --temp 0.6 --top-k 20 --top-p 0.95 -n 1024

QVAC SDK

These GGUF files are designed for deployment through the QVAC SDK, enabling fully private on-device inference on smartphones, tablets, and edge devices. See the QVAC documentation for integration guides.


# 1. Project setup
mkdir medpsy && cd medpsy
npm init -y && npm pkg set type=module

# 2. Install SDK + matching Bare runtime binary
#    (swap linux-x64 for: linux-arm64 | darwin-arm64 | darwin-x64 | win32-x64 | win32-arm64)
npm i @qvac/sdk bare-runtime-linux-x64

# 3. Authenticate with Hugging Face (one-time) and download the recommended quant
hf auth login
hf download qvac/MedPsy-1.7B-GGUF medpsy-1.7b-q4_k_m-imat.gguf --local-dir ./models

# 4. Run inference (streamed, GPU)
node --input-type=module -e '
import { loadModel, completion, unloadModel, VERBOSITY } from "@qvac/sdk";
import { resolve } from "node:path";
const id = await loadModel({
  modelSrc: resolve("./models/medpsy-1.7b-q4_k_m-imat.gguf"),
  modelType: "llamacpp-completion",
  modelConfig: { device: "gpu", ctx_size: 4096, verbosity: VERBOSITY.ERROR },
});
const r = completion({
  modelId: id,
  history: [{ role: "user", content: "First-line treatment for community-acquired pneumonia in 2 sentences." }],
  stream: true,
  generationParams: { temp: 0.6, top_p: 0.95, top_k: 20, predict: 2048 },
});
for await (const t of r.tokenStream) process.stdout.write(t);
const s = await r.stats;
console.log(`\n[${s.tokensPerSecond.toFixed(2)} tok/s, TTFT ${s.timeToFirstToken.toFixed(0)}ms, ${s.backendDevice}]`);
await unloadModel({ modelId: id });
'

Notes:

  • If hf isn't installed: pip install --user "huggingface_hub[cli,hf_xet]" (or via your conda env).
  • MedPsy is a Qwen3-Thinking model — it emits a ... block before the answer, so keep predict ≥ 1500.
  • For CPU-only boxes, set modelConfig.device: "cpu".

Use and Limitations

Intended Use

MedPsy-1.7B-GGUF is intended as a starting point for developers and researchers building downstream healthcare applications involving medical text on-device. Developers are expected to validate, adapt, and make meaningful modifications to the model for their specific use cases.

Appropriate use cases include:

  • On-device medical information retrieval for privacy-sensitive environments, including smartphones
  • Building developer tools and prototypes for health-related applications running on edge devices
  • Research on medical language understanding and reasoning under aggressive quantization

Always with appropriate disclaimers.

Limitations

This model is NOT a substitute for professional medical judgment and the model outputs are NOT a substitute for proper clinical diagnosis. Always consult with a certified physician. Despite strong benchmark performance, MedPsy-1.7B is a compact 1.7B-parameter language model, one of the smallest in its class, and will make errors. Quantization may further amplify rare-case failure modes that are not captured by aggregate benchmark numbers. Its small size makes it particularly susceptible to mistakes on complex, multi-step clinical reasoning tasks. Medical AI systems can produce outputs that appear confident and authoritative while being factually incorrect, incomplete, or clinically inappropriate.

Known limitations include:

  • Hallucinations: The model may generate plausible-sounding but incorrect medical information.
  • Quantization artifacts: Quantized models can occasionally produce subtly degraded outputs (rare-token drops, less stable formatting on long generations) that aggregate benchmarks may not capture. Effects grow at lower bit counts; we strongly recommend 4-bit (Q4_K_M, IQ4_NL, or IQ4_XS) or higher for production medical deployment, and avoid 3-bit variants (IQ3_M, IQ3_XXS) for medical use at this model size - the degradation is large enough to materially increase hallucination risk.
  • Compact model trade-offs: At 1.7B parameters, the model has inherently less capacity than larger models. It may struggle with rare conditions, complex multi-step reasoning, or nuanced clinical scenarios that require deep domain knowledge.
  • English only: The model was trained and evaluated primarily in English. Performance in other languages is not validated.
  • Text only: This model processes text inputs only. It cannot interpret medical images, lab results in non-text formats, or other modalities.
  • No real-time knowledge: The model's knowledge has a training data cutoff and does not reflect the latest medical guidelines, drug approvals, or clinical evidence.
  • Bias in training data: As with any model trained on synthetic and public medical data, biases in the source material may propagate to model outputs. Developers should validate performance across diverse patient populations, demographics, and clinical contexts.
  • Not designed for emergencies: This model should never be used as the sole decision-making tool in emergency or life-threatening situations.

Safety Recommendations

When integrating this model into any application:

  1. Always include visible disclaimers informing users that outputs are AI-generated and not a substitute for professional medical advice
  2. Do not use for direct clinical diagnosis or treatment without oversight by qualified healthcare professionals
  3. Monitor for harmful outputs and implement appropriate safety filters in production systems

Related Resources

Citation

@article{medpsy2026,
  title={MedPsy: State-of-the-Art Medical and Healthcare Language Models for Edge Devices},
  author={Vitabile, Davide and Buffa, Alexandro and Nambiar, Akshay and Nazir, Amril},
  year={2026},
  url={https://huggingface.co/blog/qvac/medpsy},
  institution={Tether AI Research}
}

Copyright

We will take appropriate actions in response to notices of copyright infringement. If you believe your work has been used or copied in a manner that infringes upon your intellectual property rights, please email data-apps@tether.io identifying and describing both the copyrighted work and alleged infringing content.

Licensing

This model, which was trained as described in the MedPsy Technical Report, is licensed by Tether Data, S.A. de C.V. under the Apache 2.0 license for research and educational purposes. As described above, this model is a version of Qwen3-1.7B, which is also available under the Apache 2.0 license.

As described above, a subset of the Genesis I and Genesis II datasets was used by the Baichuan-M3-235B model, which itself is also available under the Apache 2.0 license to generate synthetic data for training this model. The Genesis I dataset is made available under the CC-BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0) license. The Genesis II dataset is also made available under the CC-BY-NC 4.0 license.

Downloads last month
1,240
GGUF
Model size
2B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for qvac/MedPsy-1.7B-GGUF

Finetuned
Qwen/Qwen3-1.7B
Finetuned
qvac/MedPsy-1.7B
Quantized
(3)
this model

Collection including qvac/MedPsy-1.7B-GGUF