Instructions to use qvac/MedPsy-1.7B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use qvac/MedPsy-1.7B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="qvac/MedPsy-1.7B-GGUF",
	filename="medpsy-1.7b-bf16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use qvac/MedPsy-1.7B-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf qvac/MedPsy-1.7B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf qvac/MedPsy-1.7B-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf qvac/MedPsy-1.7B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf qvac/MedPsy-1.7B-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf qvac/MedPsy-1.7B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf qvac/MedPsy-1.7B-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf qvac/MedPsy-1.7B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf qvac/MedPsy-1.7B-GGUF:Q4_K_M

Use Docker

docker model run hf.co/qvac/MedPsy-1.7B-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use qvac/MedPsy-1.7B-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "qvac/MedPsy-1.7B-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "qvac/MedPsy-1.7B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/qvac/MedPsy-1.7B-GGUF:Q4_K_M

Ollama
How to use qvac/MedPsy-1.7B-GGUF with Ollama:
```
ollama run hf.co/qvac/MedPsy-1.7B-GGUF:Q4_K_M
```

Unsloth Studio new

How to use qvac/MedPsy-1.7B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for qvac/MedPsy-1.7B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for qvac/MedPsy-1.7B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for qvac/MedPsy-1.7B-GGUF to start chatting

Pi new

How to use qvac/MedPsy-1.7B-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf qvac/MedPsy-1.7B-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "qvac/MedPsy-1.7B-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use qvac/MedPsy-1.7B-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf qvac/MedPsy-1.7B-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default qvac/MedPsy-1.7B-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use qvac/MedPsy-1.7B-GGUF with Docker Model Runner:
```
docker model run hf.co/qvac/MedPsy-1.7B-GGUF:Q4_K_M
```

Lemonade

How to use qvac/MedPsy-1.7B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull qvac/MedPsy-1.7B-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.MedPsy-1.7B-GGUF-Q4_K_M

List all available models

lemonade list

MedPsy-1.7B-GGUF

MedPsy-1.7B-GGUF provides GGUF weights of MedPsy-1.7B for fast, fully on-device inference via llama.cpp and the QVAC SDK. An unquantized BF16 GGUF file (about 4.07 GB) is included alongside seven quantization formats, ranging from near-lossless 8-bit (about 2.16 GB) through a high-quality 5-bit option (about 1.47 GB) down to ultra-compact 3-bit (about 0.89 GB), making the same model deployable across everything from laptops to entry-level smartphones.


Developed by	Tether AI Research
Model type	Text-only causal language model (decoder-only transformer), GGUF quantized
Base (BF16) model	MedPsy-1.7B
Backbone	Qwen3-1.7B (Thinking)
Language	English
License	Apache 2.0
Quantization tool	llama.cpp
Technical report	MedPsy Technical Report
Collection	MedPsy on Hugging Face
All MedPsy variants	MedPsy-4B · MedPsy-1.7B · MedPsy-4B-GGUF · MedPsy-1.7B-GGUF

Available Files

All published files are produced with llama.cpp. The BF16 GGUF file is unquantized: no quantization is applied. We have not separately re-evaluated the BF16 GGUF with llama.cpp; because it preserves the same BF16 tensor precision as the source checkpoint, performance is expected to match the BF16 source model evaluated with vLLM, aside from small backend or runtime differences. Q8_0 does not use imatrix calibration (we verified that imatrix provided no measurable benefit at 8-bit). All sub-8-bit variants use importance-matrix (imatrix) calibration, which consistently reduces quality degradation. See the MedPsy Technical Report (Section 4.7) for the full quantization methodology, including the K-quants vs I-quants comparison and the per-bit-count imatrix ablation.

File	Format	Imatrix	Size	Δ Size	Δ AVG (pts)	Δ AVG (rel %)	Recommended For
`medpsy-1.7b-bf16.gguf`	BF16	n/a	4.07 GB	0%	≈0.00	≈0.00%	Unquantized GGUF (same performance expected)
`medpsy-1.7b-q8_0.gguf`	Q8_0	no (not needed)	2.16 GB	-47%	0.00	0.00%	Best quality, near-lossless
`medpsy-1.7b-q5_k_m-imat.gguf`	Q5_K_M	yes	1.47 GB	-64%	-0.02	-0.03%	Recommended high-quality 5-bit option
`medpsy-1.7b-q4_k_m-imat.gguf`	Q4_K_M	yes	1.28 GB	-69%	-0.73	-1.10%	Recommended for smartphone (best size/quality trade-off)
`medpsy-1.7b-iq4_nl-imat.gguf`	IQ4_NL	yes	1.23 GB	-70%	-1.70	-2.56%	Alternative 4-bit (slightly worse than Q4_K_M)
`medpsy-1.7b-iq4_xs-imat.gguf`	IQ4_XS	yes	1.18 GB	-71%	-1.79	-2.69%	Smaller 4-bit alternative
`medpsy-1.7b-iq3_m-imat.gguf`	IQ3_M	yes	1.03 GB	-75%	-3.58	-5.40%	⚠ Notable degradation, expect more hallucinations - not recommended for medical use
`medpsy-1.7b-iq3_xxs-imat.gguf`	IQ3_XXS	yes	0.89 GB	-78%	-12.46	-18.78%	⚠ Severe degradation - not recommended

Two ways to read quality loss. Δ AVG (pts) is the absolute change in AVG Score vs the BF16/vLLM source-model baseline - the raw points lost. Δ AVG (rel %) is the relative change as a fraction of the baseline ((baseline − variant) / baseline). They convey complementary information: the absolute delta is the easiest "how much score did I actually lose?" reading, while the relative delta normalizes by baseline so quality degradation is comparable across models with different starting scores. AVG Score = mean of HealthBench Overall and Closed-Ended Average.

Quick Recommendation

Your constraint	Choose
You want a llama.cpp-native unquantized file	BF16 - no quantization applied; same quality as the source BF16 checkpoint expected, GGUF format (4.07 GB)
You want the best possible quality at smaller size	Q8_0 - statistically indistinguishable from BF16, half the size
You want extra quality headroom over 4-bit	Q5_K_M (imatrix) - 64% smaller than BF16, only -0.02 pts (-0.03% rel) on AVG Score
You want the best size/quality trade-off (most users)	Q4_K_M (imatrix) - 69% smaller, only -0.73 pts (-1.10% rel) on AVG Score
You need the smallest 4-bit option	IQ4_XS (imatrix) - 71% smaller, -1.79 pts (-2.69% rel) (1.18 GB)
You need around 1 GB	IQ3_M, but be aware of important quality degradation and increased hallucination risk - not recommended for medical use

Benchmark Results

The comparison uses the BF16 source model evaluated with vLLM as the reference baseline. Quantized GGUF variants were evaluated on the full closed-ended benchmark suite (7 medical benchmarks, averaged) and HealthBench (CompassJudger-2-32B-Instruct as judge), using the same benchmark protocol. The BF16 GGUF file is the unquantized GGUF export and has not been separately re-run with llama.cpp; since no quantization is applied, its performance is expected to match the BF16 source model aside from small backend or runtime differences. AVG Score is the average of HealthBench Overall and Closed-Ended Average.

Variant	Size (GB)	HealthBench	HB Hard	CE Avg	AVG Score	Δ AVG (pts)	Δ AVG (rel %)	Δ Size
MedPsy-1.7B (BF16, vLLM baseline)	4.07	70	54	62.62	66.31	0.00	0.00%	0%
Q8_0 ★	2.16	70	55	62.62	66.31	0.00	0.00%	-47%
Q5_K_M ★	1.47	70	55	62.58	66.29	-0.02	-0.03%	-64%
Q4_K_M ★	1.28	69	52	62.16	65.58	-0.73	-1.10%	-69%
IQ4_NL	1.23	69	51	60.22	64.61	-1.70	-2.56%	-70%
IQ4_XS	1.18	69	53	60.05	64.53	-1.79	-2.69%	-72%
IQ3_M ⚠	1.03	67	49	58.46	62.73	-3.58	-5.40%	-75%
IQ3_XXS ⚠	0.89	59	40	48.71	53.86	-12.46	-18.78%	-79%

HB = HealthBench; CE = Closed-Ended Average (7 medical benchmarks). Δ AVG (pts) is the absolute point change in AVG Score vs the BF16/vLLM source-model baseline (e.g. 66.31 - 65.58 = -0.73). Δ AVG (rel %) is the relative change as a fraction of the baseline (e.g. -0.73 / 66.31 = -1.10%). Δ Size is the relative file-size change vs BF16. ★ Recommended variants (best of class). ⚠ Not recommended for medical use due to notable (IQ3_M) or severe (IQ3_XXS) quality degradation that increases hallucination risk. AVG Score = (HealthBench Overall + Closed-Ended Average) / 2. Results averaged over 3 runs with generation parameters: temperature=0.6, top_k=20, top_p=0.95, max_output_tokens=16384. Quantized GGUF variants were evaluated with llama.cpp; the BF16 baseline was evaluated with vLLM. HealthBench evaluated using CompassJudger-2-32B-Instruct.

Key Findings

Q8_0 is effectively lossless: identical AVG Score (66.31) to the BF16 baseline at 47% smaller (2.16 GB vs 4.07 GB), with no need for imatrix calibration.
Q5_K_M is a recommended high-quality option: -0.02 pts (-0.03% relative) AVG Score at 64% smaller (1.47 GB), while matching Q8_0 on HealthBench and HealthBench Hard.
Q4_K_M is the sweet spot: only -0.73 pts AVG Score loss (-1.10% relative) for a 69% size reduction, comfortably fitting in a typical mobile app's memory budget.
Smallest safe option is IQ4_XS (1.18 GB, -1.79 pts / -2.69% relative): the smallest variant we recommend for medical use on this 1.7B model.
3-bit is too aggressive at 1.7B: IQ3_M loses -3.58 pts (-5.40% relative) and IQ3_XXS loses -12.46 pts (-18.78% relative). At this model size, 3-bit quantization noticeably degrades reasoning quality and increases the risk of hallucinations. We do not recommend either variant for medical applications - if you need an aggressively-quantized model, prefer the 4B GGUF at IQ3_M (2.13 GB), which is far more robust at low bit counts.
Calibration matters at low bit counts: imatrix calibration improves Q4_K_M closed-ended Average by +1.58 points; the effect grows at 3-bit quantizations.

How Quantized Variants Compare to Other Models

Even the most aggressive recommended quantization (Q4_K_M at 1.28 GB) retains a substantial accuracy lead over the unquantized open-weight baselines in this size class:

Model	Size (GB)	Closed-Ended Avg	HealthBench
MedPsy-1.7B (BF16)	4.07	62.62	70
MedPsy-1.7B Q8_0 ★	2.16	62.62	70
MedPsy-1.7B Q5_K_M ★	1.47	62.58	70
MedPsy-1.7B Q4_K_M ★	1.28	62.16	69
Qwen3-1.7B (Thinking) (BF16, backbone)	4.07	49.95	53
LFM2.5-1.2B-Thinking (BF16)	2.6	44.15	49

Usage

llama.cpp

# Download the recommended file (Q4_K_M with imatrix - best size/quality for smartphone)
huggingface-cli download qvac/MedPsy-1.7B-GGUF medpsy-1.7b-q4_k_m-imat.gguf --local-dir .

# Run interactively
./llama-cli -m medpsy-1.7b-q4_k_m-imat.gguf \
    -p "What are the common symptoms and first-line treatments for community-acquired pneumonia?" \
    --temp 0.6 --top-k 20 --top-p 0.95 -n 1024

QVAC SDK

These GGUF files are designed for deployment through the QVAC SDK, enabling fully private on-device inference on smartphones, tablets, and edge devices. See the QVAC documentation for integration guides.


# 1. Project setup
mkdir medpsy && cd medpsy
npm init -y && npm pkg set type=module

# 2. Install SDK + matching Bare runtime binary
#    (swap linux-x64 for: linux-arm64 | darwin-arm64 | darwin-x64 | win32-x64 | win32-arm64)
npm i @qvac/sdk bare-runtime-linux-x64

# 3. Authenticate with Hugging Face (one-time) and download the recommended quant
hf auth login
hf download qvac/MedPsy-1.7B-GGUF medpsy-1.7b-q4_k_m-imat.gguf --local-dir ./models

# 4. Run inference (streamed, GPU)
node --input-type=module -e '
import { loadModel, completion, unloadModel, VERBOSITY } from "@qvac/sdk";
import { resolve } from "node:path";
const id = await loadModel({
  modelSrc: resolve("./models/medpsy-1.7b-q4_k_m-imat.gguf"),
  modelType: "llamacpp-completion",
  modelConfig: { device: "gpu", ctx_size: 4096, verbosity: VERBOSITY.ERROR },
});
const r = completion({
  modelId: id,
  history: [{ role: "user", content: "First-line treatment for community-acquired pneumonia in 2 sentences." }],
  stream: true,
  generationParams: { temp: 0.6, top_p: 0.95, top_k: 20, predict: 2048 },
});
for await (const t of r.tokenStream) process.stdout.write(t);
const s = await r.stats;
console.log(`\n[${s.tokensPerSecond.toFixed(2)} tok/s, TTFT ${s.timeToFirstToken.toFixed(0)}ms, ${s.backendDevice}]`);
await unloadModel({ modelId: id });
'

Notes:

If hf isn't installed: pip install --user "huggingface_hub[cli,hf_xet]" (or via your conda env).
MedPsy is a Qwen3-Thinking model — it emits a ... block before the answer, so keep predict ≥ 1500.
For CPU-only boxes, set modelConfig.device: "cpu".

Use and Limitations

Intended Use

MedPsy-1.7B-GGUF is intended as a starting point for developers and researchers building downstream healthcare applications involving medical text on-device. Developers are expected to validate, adapt, and make meaningful modifications to the model for their specific use cases.

Appropriate use cases include:

On-device medical information retrieval for privacy-sensitive environments, including smartphones
Building developer tools and prototypes for health-related applications running on edge devices
Research on medical language understanding and reasoning under aggressive quantization

Always with appropriate disclaimers.

Limitations

This model is NOT a substitute for professional medical judgment and the model outputs are NOT a substitute for proper clinical diagnosis. Always consult with a certified physician. Despite strong benchmark performance, MedPsy-1.7B is a compact 1.7B-parameter language model, one of the smallest in its class, and will make errors. Quantization may further amplify rare-case failure modes that are not captured by aggregate benchmark numbers. Its small size makes it particularly susceptible to mistakes on complex, multi-step clinical reasoning tasks. Medical AI systems can produce outputs that appear confident and authoritative while being factually incorrect, incomplete, or clinically inappropriate.

Known limitations include:

Hallucinations: The model may generate plausible-sounding but incorrect medical information.
Quantization artifacts: Quantized models can occasionally produce subtly degraded outputs (rare-token drops, less stable formatting on long generations) that aggregate benchmarks may not capture. Effects grow at lower bit counts; we strongly recommend 4-bit (Q4_K_M, IQ4_NL, or IQ4_XS) or higher for production medical deployment, and avoid 3-bit variants (IQ3_M, IQ3_XXS) for medical use at this model size - the degradation is large enough to materially increase hallucination risk.
Compact model trade-offs: At 1.7B parameters, the model has inherently less capacity than larger models. It may struggle with rare conditions, complex multi-step reasoning, or nuanced clinical scenarios that require deep domain knowledge.
English only: The model was trained and evaluated primarily in English. Performance in other languages is not validated.
Text only: This model processes text inputs only. It cannot interpret medical images, lab results in non-text formats, or other modalities.
No real-time knowledge: The model's knowledge has a training data cutoff and does not reflect the latest medical guidelines, drug approvals, or clinical evidence.
Bias in training data: As with any model trained on synthetic and public medical data, biases in the source material may propagate to model outputs. Developers should validate performance across diverse patient populations, demographics, and clinical contexts.
Not designed for emergencies: This model should never be used as the sole decision-making tool in emergency or life-threatening situations.

Safety Recommendations

When integrating this model into any application:

Always include visible disclaimers informing users that outputs are AI-generated and not a substitute for professional medical advice
Do not use for direct clinical diagnosis or treatment without oversight by qualified healthcare professionals
Monitor for harmful outputs and implement appropriate safety filters in production systems

Related Resources

MedPsy Collection: All MedPsy models, datasets, and resources in one place
MedPsy-1.7B (BF16): Full-precision source model
MedPsy-4B-GGUF: Larger GGUF sibling for higher-quality edge deployment
MedPsy Technical Report: Full quantization methodology and results (Section 4.7)
QVAC SDK: On-device AI deployment framework
llama.cpp: Inference engine and quantization toolchain

Citation

@article{medpsy2026,
  title={MedPsy: State-of-the-Art Medical and Healthcare Language Models for Edge Devices},
  author={Vitabile, Davide and Buffa, Alexandro and Nambiar, Akshay and Nazir, Amril},
  year={2026},
  url={https://huggingface.co/blog/qvac/medpsy},
  institution={Tether AI Research}
}

Copyright

We will take appropriate actions in response to notices of copyright infringement. If you believe your work has been used or copied in a manner that infringes upon your intellectual property rights, please email data-apps@tether.io identifying and describing both the copyrighted work and alleged infringing content.

Licensing

This model, which was trained as described in the MedPsy Technical Report, is licensed by Tether Data, S.A. de C.V. under the Apache 2.0 license for research and educational purposes. As described above, this model is a version of Qwen3-1.7B, which is also available under the Apache 2.0 license.

As described above, a subset of the Genesis I and Genesis II datasets was used by the Baichuan-M3-235B model, which itself is also available under the Apache 2.0 license to generate synthetic data for training this model. The Genesis I dataset is made available under the CC-BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0) license. The Genesis II dataset is also made available under the CC-BY-NC 4.0 license.