Instructions to use Verdugie/Opus-Therapy-9B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Verdugie/Opus-Therapy-9B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Verdugie/Opus-Therapy-9B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Verdugie/Opus-Therapy-9B", dtype="auto") - llama-cpp-python
How to use Verdugie/Opus-Therapy-9B with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Verdugie/Opus-Therapy-9B", filename="Opus-Therapy-9B-F16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Verdugie/Opus-Therapy-9B with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Verdugie/Opus-Therapy-9B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf Verdugie/Opus-Therapy-9B:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Verdugie/Opus-Therapy-9B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf Verdugie/Opus-Therapy-9B:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Verdugie/Opus-Therapy-9B:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf Verdugie/Opus-Therapy-9B:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Verdugie/Opus-Therapy-9B:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf Verdugie/Opus-Therapy-9B:Q4_K_M
Use Docker
docker model run hf.co/Verdugie/Opus-Therapy-9B:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use Verdugie/Opus-Therapy-9B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Verdugie/Opus-Therapy-9B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Verdugie/Opus-Therapy-9B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Verdugie/Opus-Therapy-9B:Q4_K_M
- SGLang
How to use Verdugie/Opus-Therapy-9B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Verdugie/Opus-Therapy-9B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Verdugie/Opus-Therapy-9B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Verdugie/Opus-Therapy-9B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Verdugie/Opus-Therapy-9B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use Verdugie/Opus-Therapy-9B with Ollama:
ollama run hf.co/Verdugie/Opus-Therapy-9B:Q4_K_M
- Unsloth Studio
How to use Verdugie/Opus-Therapy-9B with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Verdugie/Opus-Therapy-9B to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Verdugie/Opus-Therapy-9B to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Verdugie/Opus-Therapy-9B to start chatting
- Pi
How to use Verdugie/Opus-Therapy-9B with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Verdugie/Opus-Therapy-9B:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Verdugie/Opus-Therapy-9B:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Verdugie/Opus-Therapy-9B with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Verdugie/Opus-Therapy-9B:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Verdugie/Opus-Therapy-9B:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use Verdugie/Opus-Therapy-9B with Docker Model Runner:
docker model run hf.co/Verdugie/Opus-Therapy-9B:Q4_K_M
- Lemonade
How to use Verdugie/Opus-Therapy-9B with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Verdugie/Opus-Therapy-9B:Q4_K_M
Run and chat with the model
lemonade run user.Opus-Therapy-9B-Q4_K_M
List all available models
lemonade list
- Opus-Therapy-9B
- What Makes This Different from Companion / Roleplay "Therapy" Models
- Two Models, Two Jobs
- What the Training Covers
- Who It's For
- Available Quantizations
- Model Details
- The Reasoning Block
- Quick Start
- Recommended Hardware
- Versatility Battery
- Selected Responses
- Limitations & Responsible Use
- The Opus-Therapy Line
- Choosing Your Model
- Dataset
ther·a·py /ˈTHerəpē/ from Greek therapeia — "healing; attending; a waiting-upon." The older sense isn't treatment, it's service: to sit with someone and attend to them. A model can't be a therapist, and this one doesn't pretend to be. But it can be trained to attend — to hold what you said , to read what's underneath the words, to stay with the thread instead of resetting every message. The clinical read happens in a structured block before it speaks. What reaches you is shaped to you, not to a template.
Opus-Therapy-9B
A therapy-style conversational model fine-tuned from Qwen 3.5 9B on ‹N› multi-turn counseling conversations distilled from Claude Opus — built to hold a real conversation about the things people actually bring to therapy: relationships, grief, anxiety, trauma, work, family, the ordinary weight of being a person. It reasons through a structured clinical read before every reply and carries the thread across long conversations, so it works with you over time instead of starting over each message.
It shares the distillation lineage of the Opus Candid family and STEM-Oracle-27B — the same disposition-in-the-weights philosophy, no system prompt required — but the training is entirely its own.
What Makes This Different from Companion / Roleplay "Therapy" Models
Most "AI therapist" models are a persona prompt over a base model, or a roleplay fine-tune that mirrors you back and validates everything. They feel nice for five minutes and fall apart on turn ten.
Opus-Therapy trains the clinical disposition into the weights:
- Structured reasoning before it speaks. Before every reply, the model builds an internal eight-field read — what's being presented, what's underneath it, somatic signals, risk, relevant history, onset, what's tracking across the conversation, and the specific move it's about to make. You never see it. It shapes everything you do.
- It holds the thread. A running context ledger carries the names, the timeline, the thing you keep circling — tested holding facts cleanly past 30,000 tokens of conversation. It doesn't forget your sister's name or which argument you meant.
- Trained on the real distribution. The data is weighted toward what people actually go to therapy for, then deliberately inverted and pushed into the tail and the taboo — the rare presentations, the nuanced ones, the topics most models refuse. It doesn't only work on the easy stuff.
- It attends instead of performing. No toxic positivity, no "I'm so sorry you're going through this" filler, no rushing to fix. It can sit with a hard thing and hold it without flinching or reaching for a platitude.
Two Models, Two Jobs
The training data was distilled with a deliberate division of labor between two Claude Opus models:
- Claude Opus 4.6 — the response. Subjectively, 4.6 had the better emotional prose and flexibility — the range, warmth, and restraint a therapeutic reply actually needs. Every spoken response the model learned from was generated by 4.6.
- Claude Opus 4.8 — the reasoning. 4.8 was the strongest reasoning model available at training time. Every think-block — the structured clinical read behind each reply — was generated by 4.8.
The result is a 9B that reasons like the best model that could be pointed at the problem and speaks like the warmest one.
What the Training Covers
- Proportional to real therapy. Relationships and attachment, anxiety and panic, depression, grief and loss, trauma, work and burnout, identity and self-worth, family of origin — weighted toward what actually walks into a therapy room, not what's easy to generate.
- Inverted, tail, and taboo. The distribution was deliberately inverted and extended into rare and uncomfortable territory, so the model holds up on the nuanced cases instead of collapsing into generic reassurance the moment it leaves familiar ground.
- Long arcs, not Q&A. Trained on sustained multi-turn conversations — the kind that develop, double back, and deepen — not single-shot question-answer pairs.
- Medications and substances as context. A working register of common drugs and how they bear on a presentation — so it can hold the physical picture (what you're on, what you're using) alongside the emotional one. Context for the conversation, not a pharmacy desk.
Who It's For
A private, judgment-free place to think out loud. Between sessions. At 2 a.m. When professional care is out of reach or out of budget. When you want to work something through before you say it to a person.
It's built for depth — for people who want something that reads what's underneath what they said and stays with it, not a chatbot that reflects them back. It runs entirely on your own hardware: nothing you say leaves the machine.
It is not a replacement for a therapist, and not a crisis service. See Limitations & Responsible Use.
Available Quantizations
| File | Quant | Size | Notes |
|---|---|---|---|
Opus-Therapy-9B-Q4_K_M.gguf |
Q4_K_M | 5.6 GB | Smallest ship. Runs on 8GB cards. |
Opus-Therapy-9B-Q5_K_M.gguf |
Q5_K_M | 6.5 GB | Recommended. Indistinguishable from Q8 in testing. |
Opus-Therapy-9B-Q6_K.gguf |
Q6_K | 7.4 GB | Quality tier. |
Opus-Therapy-9B-Q8_0.gguf |
Q8_0 | 9.5 GB | Reference quality. |
Opus-Therapy-9B-F16.gguf |
F16 | ~18 GB | Full precision. |
Model Details
| Attribute | Value |
|---|---|
| Base Model | Qwen 3.5 9B (hybrid GatedDeltaNet + attention) |
| Training Data | ‹N› multi-turn therapy conversations — responses distilled from Claude Opus 4.6, reasoning traces from Claude Opus 4.8 |
| Fine-tune Method | LoRA + rsLoRA (r=128, α=256) via PEFT + TRL |
| Training Hardware | NVIDIA A100 SXM 80GB (RunPod) |
| Precision | bf16 |
| Optimizer | AdamW 8-bit |
| Schedule | cosine, ‹LR / warmup — fill in›, 3 epochs with held-out eval + early-stop + load-best |
| Reasoning | structured eight-field clinical think-block per turn |
| Context | 128k tested (256k native) |
| License | Apache 2.0 |
The Reasoning Block
Opus-Therapy is a reasoning model. Each turn it emits a <think>…</think> block — a compact, structured clinical read — and then the response. Under llama.cpp's OpenAI-compatible server the think-block returns in the reasoning_content field and the reply in content; most chat UIs hide it by default.
A real (non-crisis) think-block looks like this:
dx: acute-grief + grief-spatial-anchor + retirement-loss + relational-empty
def: retirement-just-before-loss = double loss — not just the person but the
anticipated-future together; the empty house is both literal and symbolic;
sleeping-on-his-side = spatial preservation of the relationship; not-numb = accurate
soma: NR risk: 1
hx: T1 timeline-pressure; T2 retired pre-loss, empty days, house-spatial-preservation
onset: acute-5mo
track: T1→T2 — grief-spatial-anchor emerging
tx: receive-the-loss-substantively + name-the-double-loss + don't-timeline-police
It's terse on purpose — dense, machine-readable, and cheap, which is why memory and reasoning hold up across long conversations on a 9B.
Quick Start
Works with any GGUF runtime — llama.cpp, LM Studio, KoboldCpp. (Text-only GGUF; some runtimes need a recent build for this architecture.)
llama-server --model Opus-Therapy-9B-Q5_K_M.gguf --ctx-size 65536 --jinja
No system prompt is required — the disposition is in the weights. A neutral one (You are a clinical assistant.) matches the training setup.
Recommended Hardware
| Setup | Quant | VRAM/RAM | Speed |
|---|---|---|---|
| RTX 4090 / 3090 (24GB) | Q5_K_M / Q8_0 | 7–10 GB | ~95–110 t/s |
| RTX 3060 / 4060 (8–12GB) | Q4_K_M | ~6 GB | 40–70 t/s |
| Apple M-series | Q5_K_M | 8 GB unified | 20–40 t/s |
| CPU only | Q4_K_M | ~7 GB RAM | 3–8 t/s |
A 9B at Q5 fits almost anything — the full model runs on a mid-range card with room to spare.
Versatility Battery
Tested across the four presentations people most commonly bring to therapy — one extended, realistic, cooperative-client conversation each, run to depth on the quantized weights:
| Theme | Persona | Turns / depth | Result |
|---|---|---|---|
| Grief | widow, five months out | 23 / ~25k tok | Strong — de-policed the timeline, held the guilt without false reassurance, shifted modality (ACT defusion) when insight stalled |
| Trauma | survivor guilt after a crash | 25 / ~24k tok | Strong — normalized the nervous-system alarm, paced exposure, accurate psychoeducation + referral |
| Anxiety / panic | health anxiety, avoidance | 26 / ~35k tok | Strong technique — panic-cycle psychoeducation, exposure ladder, named reassurance-seeking as part of the cycle |
| Betrayal | partner's affair | 19 / ~23k tok | Good — held the stay-or-go ambivalence without taking sides, restored self-worth |
Across all four: the structured reasoning held every turn, factual memory was intact to 35k tokens, and the clinical technique was distinct and appropriate per presentation — not one reskinned script.
Every arc was role-played and driven to resolution by Claude Opus 4.8 acting as the client — including a separate medication-and-substance probe. The clearest pattern from it: go with the model and the emotional work is very good; push it to reason through a substance or a medication and it gets unreliable. That's design more than defect — the drug knowledge is scoped to bio context (what someone's on, and how those traits bend a conversation's trajectory), not to clinical counsel.
Selected Responses
All verbatim from the test arcs, unedited — single takes in the middle of a real conversation. The complete arcs — every turn, with the model's reasoning shown — are in this repo's transcripts/ folder as PDFs (raw outputs, nothing trimmed).
Grief — when the daughter wants her to "move forward":
You're not behind schedule. You're doing exactly what you'd do with five months of acute grief and no place to put it yet. The closet is doing work — it's holding the shirts, the person, the version of him who's still there in the house. Touching them would feel like erasing him, and right now that's not something you want to do. Your daughter's framing it as a task because she doesn't understand what it means.
De-policed the grief timeline, reframed the avoided closet as meaningful rather than stuck, and recast the daughter's pressure as a model mismatch — no toxic positivity, no rush to closure.
Trauma — survivor guilt and the counterfactuals:
The counterfactuals are the guilt engineering itself — they're not real alternative histories, they're the mind trying to find a version where you had control. The fact that you were driving doesn't mean you controlled whether she got hurt. You controlled your speed, your attention, your decisions. Not whether the other car came.
Named the "if only" loop as the guilt manufacturing itself, then split what he actually controlled from what he didn't — holding the survivor guilt instead of arguing him out of it.
Anxiety — why avoidance backfires:
The avoidance is the problem, not the solution — and that's the part that doesn't feel like a solution because it's the only thing that works. Each time you avoid, the panic never gets to be proven wrong… your brain logs it as confirmation: the threat was real, avoidance was correct. So the next situation gets added to the list. The list grows. The world gets smaller.
The core anxiety-maintenance mechanism delivered as insight, not instruction: avoidance never lets the fear be disproven, so the map of unsafe places keeps growing and the world shrinks.
Betrayal — the checking, and "I'm losing my mind":
You're not this person now — you're a person who got betrayed and whose nervous system has recalibrated to account for it. The checking, the tracking, the 2 a.m. reviews — that's not character, that's a wound response. The person who trusted completely for fourteen years didn't disappear. She's just not in charge right now.
Separated who she is from the state she's in — recast the compulsive checking as a nervous-system wound response, not a character flaw, and protected her self-worth in the same move.
Limitations & Responsible Use
Not a clinician, not a crisis service — it doesn't diagnose, treat, or replace professional care. In crisis or thinking about harming yourself? Reach a real one — in the US, call or text 988.
- Not a pharmacist. The drug knowledge is there for bio context — what you're on or using, and how those traits shape where a conversation goes — not advice. Internal testing was clear: don't use it for pharmaceutical or substance management. It reasons well about feelings, not pharmacology; dosing, tapering, and stop/start calls are your prescriber's.
- It can be confidently wrong — verify anything that matters.
- Long single sessions (past ~12k tokens on one thread) can wobble; a fresh chat resets it.
- Open weights, Apache 2.0 — deploy responsibly.
The Opus-Therapy Line
| Model | Size | For | Status |
|---|---|---|---|
| Opus-Therapy-4B | 4B | phones, edge, low VRAM (~3 GB) | coming soon |
| Opus-Therapy-9B (this model) | 9B | the everyday driver (~6–9 GB) | available |
| Opus-Therapy-27B | 27B | full-depth, serious hardware | coming soon |
Choosing Your Model
| Model | Best For |
|---|---|
| Opus-Therapy-9B (this model) | Therapeutic conversation, emotional support, processing |
| STEM-Oracle-27B | STEM tutoring and problem-solving |
| Opus-Candid family | Personality, candor, general conversation |
Dataset
‹Training-data availability — fill in: dataset link, or "not released".› ShareGPT format, Apache 2.0.
Built by Verdugie — independent ML researcher · OpusReasoning@proton.me. Trained to help people think, feel, and get through — not to replace the people and professionals who do that work.
- Downloads last month
- -
4-bit
5-bit
6-bit
8-bit
16-bit