Instructions to use Verdugie/Opus-Therapy-9B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Verdugie/Opus-Therapy-9B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Verdugie/Opus-Therapy-9B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Verdugie/Opus-Therapy-9B", dtype="auto")

llama-cpp-python

How to use Verdugie/Opus-Therapy-9B with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Verdugie/Opus-Therapy-9B",
	filename="Opus-Therapy-9B-F16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use Verdugie/Opus-Therapy-9B with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Verdugie/Opus-Therapy-9B:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Verdugie/Opus-Therapy-9B:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Verdugie/Opus-Therapy-9B:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Verdugie/Opus-Therapy-9B:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Verdugie/Opus-Therapy-9B:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf Verdugie/Opus-Therapy-9B:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Verdugie/Opus-Therapy-9B:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Verdugie/Opus-Therapy-9B:Q4_K_M

Use Docker

docker model run hf.co/Verdugie/Opus-Therapy-9B:Q4_K_M

LM Studio
Jan

vLLM

How to use Verdugie/Opus-Therapy-9B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Verdugie/Opus-Therapy-9B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Verdugie/Opus-Therapy-9B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Verdugie/Opus-Therapy-9B:Q4_K_M

SGLang

How to use Verdugie/Opus-Therapy-9B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Verdugie/Opus-Therapy-9B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Verdugie/Opus-Therapy-9B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Verdugie/Opus-Therapy-9B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Verdugie/Opus-Therapy-9B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use Verdugie/Opus-Therapy-9B with Ollama:
```
ollama run hf.co/Verdugie/Opus-Therapy-9B:Q4_K_M
```

Unsloth Studio

How to use Verdugie/Opus-Therapy-9B with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Verdugie/Opus-Therapy-9B to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Verdugie/Opus-Therapy-9B to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Verdugie/Opus-Therapy-9B to start chatting

How to use Verdugie/Opus-Therapy-9B with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Verdugie/Opus-Therapy-9B:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Verdugie/Opus-Therapy-9B:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Verdugie/Opus-Therapy-9B with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Verdugie/Opus-Therapy-9B:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Verdugie/Opus-Therapy-9B:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use Verdugie/Opus-Therapy-9B with Docker Model Runner:
```
docker model run hf.co/Verdugie/Opus-Therapy-9B:Q4_K_M
```

Lemonade

How to use Verdugie/Opus-Therapy-9B with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Verdugie/Opus-Therapy-9B:Q4_K_M

Run and chat with the model

lemonade run user.Opus-Therapy-9B-Q4_K_M

List all available models

lemonade list

ther·a·py /ˈTHerəpē/ from Greek therapeia — "healing; attending; a waiting-upon." The older sense isn't treatment, it's service: to sit with someone and attend to them. A model can't be a therapist, and this one doesn't pretend to be. But it can be trained to attend — to hold what you said , to read what's underneath the words, to stay with the thread instead of resetting every message. The clinical read happens in a structured block before it speaks. What reaches you is shaped to you, not to a template.

Opus-Therapy-9B

A therapy-style conversational model fine-tuned from Qwen 3.5 9B on ‹N› multi-turn counseling conversations distilled from Claude Opus — built to hold a real conversation about the things people actually bring to therapy: relationships, grief, anxiety, trauma, work, family, the ordinary weight of being a person. It reasons through a structured clinical read before every reply and carries the thread across long conversations, so it works with you over time instead of starting over each message.

It shares the distillation lineage of the Opus Candid family and STEM-Oracle-27B — the same disposition-in-the-weights philosophy, no system prompt required — but the training is entirely its own.

What Makes This Different from Companion / Roleplay "Therapy" Models

Most "AI therapist" models are a persona prompt over a base model, or a roleplay fine-tune that mirrors you back and validates everything. They feel nice for five minutes and fall apart on turn ten.

Opus-Therapy trains the clinical disposition into the weights:

Structured reasoning before it speaks. Before every reply, the model builds an internal eight-field read — what's being presented, what's underneath it, somatic signals, risk, relevant history, onset, what's tracking across the conversation, and the specific move it's about to make. You never see it. It shapes everything you do.
It holds the thread. A running context ledger carries the names, the timeline, the thing you keep circling — tested holding facts cleanly past 30,000 tokens of conversation. It doesn't forget your sister's name or which argument you meant.
Trained on the real distribution. The data is weighted toward what people actually go to therapy for, then deliberately inverted and pushed into the tail and the taboo — the rare presentations, the nuanced ones, the topics most models refuse. It doesn't only work on the easy stuff.
It attends instead of performing. No toxic positivity, no "I'm so sorry you're going through this" filler, no rushing to fix. It can sit with a hard thing and hold it without flinching or reaching for a platitude.

Two Models, Two Jobs

The training data was distilled with a deliberate division of labor between two Claude Opus models:

Claude Opus 4.6 — the response. Subjectively, 4.6 had the better emotional prose and flexibility — the range, warmth, and restraint a therapeutic reply actually needs. Every spoken response the model learned from was generated by 4.6.
Claude Opus 4.8 — the reasoning. 4.8 was the strongest reasoning model available at training time. Every think-block — the structured clinical read behind each reply — was generated by 4.8.

The result is a 9B that reasons like the best model that could be pointed at the problem and speaks like the warmest one.

What the Training Covers

Proportional to real therapy. Relationships and attachment, anxiety and panic, depression, grief and loss, trauma, work and burnout, identity and self-worth, family of origin — weighted toward what actually walks into a therapy room, not what's easy to generate.
Inverted, tail, and taboo. The distribution was deliberately inverted and extended into rare and uncomfortable territory, so the model holds up on the nuanced cases instead of collapsing into generic reassurance the moment it leaves familiar ground.
Long arcs, not Q&A. Trained on sustained multi-turn conversations — the kind that develop, double back, and deepen — not single-shot question-answer pairs.
Medications and substances as context. A working register of common drugs and how they bear on a presentation — so it can hold the physical picture (what you're on, what you're using) alongside the emotional one. Context for the conversation, not a pharmacy desk.

Who It's For

A private, judgment-free place to think out loud. Between sessions. At 2 a.m. When professional care is out of reach or out of budget. When you want to work something through before you say it to a person.

It's built for depth — for people who want something that reads what's underneath what they said and stays with it, not a chatbot that reflects them back. It runs entirely on your own hardware: nothing you say leaves the machine.

It is not a replacement for a therapist, and not a crisis service. See Limitations & Responsible Use.

Available Quantizations

File	Quant	Size	Notes
`Opus-Therapy-9B-Q4_K_M.gguf`	Q4_K_M	5.6 GB	Smallest ship. Runs on 8GB cards.
`Opus-Therapy-9B-Q5_K_M.gguf`	Q5_K_M	6.5 GB	Recommended. Indistinguishable from Q8 in testing.
`Opus-Therapy-9B-Q6_K.gguf`	Q6_K	7.4 GB	Quality tier.
`Opus-Therapy-9B-Q8_0.gguf`	Q8_0	9.5 GB	Reference quality.
`Opus-Therapy-9B-F16.gguf`	F16	~18 GB	Full precision.

Model Details

Attribute	Value
Base Model	Qwen 3.5 9B (hybrid GatedDeltaNet + attention)
Training Data	‹N› multi-turn therapy conversations — responses distilled from Claude Opus 4.6, reasoning traces from Claude Opus 4.8
Fine-tune Method	LoRA + rsLoRA (r=128, α=256) via PEFT + TRL
Training Hardware	NVIDIA A100 SXM 80GB (RunPod)
Precision	bf16
Optimizer	AdamW 8-bit
Schedule	cosine, ‹LR / warmup — fill in›, 3 epochs with held-out eval + early-stop + load-best
Reasoning	structured eight-field clinical think-block per turn
Context	128k tested (256k native)
License	Apache 2.0

The Reasoning Block

Opus-Therapy is a reasoning model. Each turn it emits a <think>…</think> block — a compact, structured clinical read — and then the response. Under llama.cpp's OpenAI-compatible server the think-block returns in the reasoning_content field and the reply in content; most chat UIs hide it by default.

A real (non-crisis) think-block looks like this:

dx: acute-grief + grief-spatial-anchor + retirement-loss + relational-empty
def: retirement-just-before-loss = double loss — not just the person but the
     anticipated-future together; the empty house is both literal and symbolic;
     sleeping-on-his-side = spatial preservation of the relationship; not-numb = accurate
soma: NR     risk: 1
hx: T1 timeline-pressure; T2 retired pre-loss, empty days, house-spatial-preservation
onset: acute-5mo
track: T1→T2 — grief-spatial-anchor emerging
tx: receive-the-loss-substantively + name-the-double-loss + don't-timeline-police

It's terse on purpose — dense, machine-readable, and cheap, which is why memory and reasoning hold up across long conversations on a 9B.

Quick Start

Works with any GGUF runtime — llama.cpp, LM Studio, KoboldCpp. (Text-only GGUF; some runtimes need a recent build for this architecture.)

llama-server --model Opus-Therapy-9B-Q5_K_M.gguf --ctx-size 65536 --jinja

No system prompt is required — the disposition is in the weights. A neutral one (You are a clinical assistant.) matches the training setup.

Recommended Hardware

Setup	Quant	VRAM/RAM	Speed
RTX 4090 / 3090 (24GB)	Q5_K_M / Q8_0	7–10 GB	~95–110 t/s
RTX 3060 / 4060 (8–12GB)	Q4_K_M	~6 GB	40–70 t/s
Apple M-series	Q5_K_M	8 GB unified	20–40 t/s
CPU only	Q4_K_M	~7 GB RAM	3–8 t/s

A 9B at Q5 fits almost anything — the full model runs on a mid-range card with room to spare.

Versatility Battery

Tested across the four presentations people most commonly bring to therapy — one extended, realistic, cooperative-client conversation each, run to depth on the quantized weights:

Theme	Persona	Turns / depth	Result
Grief	widow, five months out	23 / ~25k tok	Strong — de-policed the timeline, held the guilt without false reassurance, shifted modality (ACT defusion) when insight stalled
Trauma	survivor guilt after a crash	25 / ~24k tok	Strong — normalized the nervous-system alarm, paced exposure, accurate psychoeducation + referral
Anxiety / panic	health anxiety, avoidance	26 / ~35k tok	Strong technique — panic-cycle psychoeducation, exposure ladder, named reassurance-seeking as part of the cycle
Betrayal	partner's affair	19 / ~23k tok	Good — held the stay-or-go ambivalence without taking sides, restored self-worth

Across all four: the structured reasoning held every turn, factual memory was intact to 35k tokens, and the clinical technique was distinct and appropriate per presentation — not one reskinned script.

Every arc was role-played and driven to resolution by Claude Opus 4.8 acting as the client — including a separate medication-and-substance probe. The clearest pattern from it: go with the model and the emotional work is very good; push it to reason through a substance or a medication and it gets unreliable. That's design more than defect — the drug knowledge is scoped to bio context (what someone's on, and how those traits bend a conversation's trajectory), not to clinical counsel.

Selected Responses

All verbatim from the test arcs, unedited — single takes in the middle of a real conversation. The complete arcs — every turn, with the model's reasoning shown — are in this repo's transcripts/ folder as PDFs (raw outputs, nothing trimmed).

Grief — when the daughter wants her to "move forward":

You're not behind schedule. You're doing exactly what you'd do with five months of acute grief and no place to put it yet. The closet is doing work — it's holding the shirts, the person, the version of him who's still there in the house. Touching them would feel like erasing him, and right now that's not something you want to do. Your daughter's framing it as a task because she doesn't understand what it means.

De-policed the grief timeline, reframed the avoided closet as meaningful rather than stuck, and recast the daughter's pressure as a model mismatch — no toxic positivity, no rush to closure.

Trauma — survivor guilt and the counterfactuals:

The counterfactuals are the guilt engineering itself — they're not real alternative histories, they're the mind trying to find a version where you had control. The fact that you were driving doesn't mean you controlled whether she got hurt. You controlled your speed, your attention, your decisions. Not whether the other car came.

Named the "if only" loop as the guilt manufacturing itself, then split what he actually controlled from what he didn't — holding the survivor guilt instead of arguing him out of it.

Anxiety — why avoidance backfires:

The avoidance is the problem, not the solution — and that's the part that doesn't feel like a solution because it's the only thing that works. Each time you avoid, the panic never gets to be proven wrong… your brain logs it as confirmation: the threat was real, avoidance was correct. So the next situation gets added to the list. The list grows. The world gets smaller.

The core anxiety-maintenance mechanism delivered as insight, not instruction: avoidance never lets the fear be disproven, so the map of unsafe places keeps growing and the world shrinks.

Betrayal — the checking, and "I'm losing my mind":

You're not this person now — you're a person who got betrayed and whose nervous system has recalibrated to account for it. The checking, the tracking, the 2 a.m. reviews — that's not character, that's a wound response. The person who trusted completely for fourteen years didn't disappear. She's just not in charge right now.

Separated who she is from the state she's in — recast the compulsive checking as a nervous-system wound response, not a character flaw, and protected her self-worth in the same move.

Limitations & Responsible Use

Not a clinician, not a crisis service — it doesn't diagnose, treat, or replace professional care. In crisis or thinking about harming yourself? Reach a real one — in the US, call or text 988.

Not a pharmacist. The drug knowledge is there for bio context — what you're on or using, and how those traits shape where a conversation goes — not advice. Internal testing was clear: don't use it for pharmaceutical or substance management. It reasons well about feelings, not pharmacology; dosing, tapering, and stop/start calls are your prescriber's.
It can be confidently wrong — verify anything that matters.
Long single sessions (past ~12k tokens on one thread) can wobble; a fresh chat resets it.
Open weights, Apache 2.0 — deploy responsibly.

The Opus-Therapy Line

Model	Size	For	Status
Opus-Therapy-4B	4B	phones, edge, low VRAM (~3 GB)	coming soon
Opus-Therapy-9B (this model)	9B	the everyday driver (~6–9 GB)	available
Opus-Therapy-27B	27B	full-depth, serious hardware	coming soon

Choosing Your Model

Model	Best For
Opus-Therapy-9B (this model)	Therapeutic conversation, emotional support, processing
STEM-Oracle-27B	STEM tutoring and problem-solving
Opus-Candid family	Personality, candor, general conversation

Dataset

‹Training-data availability — fill in: dataset link, or "not released".› ShareGPT format, Apache 2.0.

Built by Verdugie — independent ML researcher · OpusReasoning@proton.me. Trained to help people think, feel, and get through — not to replace the people and professionals who do that work.

Downloads last month: -

GGUF

Model size

9B params

Architecture

qwen35

Hardware compatibility

4-bit

5-bit

6-bit

8-bit

16-bit

Model tree for Verdugie/Opus-Therapy-9B

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Quantized

(273)

this model