Instructions to use njmason/caveman-qwen3.6-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use njmason/caveman-qwen3.6-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="njmason/caveman-qwen3.6-GGUF",
	filename="caveman-qwen3.6-BF16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use njmason/caveman-qwen3.6-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf njmason/caveman-qwen3.6-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf njmason/caveman-qwen3.6-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf njmason/caveman-qwen3.6-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf njmason/caveman-qwen3.6-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf njmason/caveman-qwen3.6-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf njmason/caveman-qwen3.6-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf njmason/caveman-qwen3.6-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf njmason/caveman-qwen3.6-GGUF:Q4_K_M

Use Docker

docker model run hf.co/njmason/caveman-qwen3.6-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use njmason/caveman-qwen3.6-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "njmason/caveman-qwen3.6-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "njmason/caveman-qwen3.6-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/njmason/caveman-qwen3.6-GGUF:Q4_K_M

Ollama
How to use njmason/caveman-qwen3.6-GGUF with Ollama:
```
ollama run hf.co/njmason/caveman-qwen3.6-GGUF:Q4_K_M
```

Unsloth Studio new

How to use njmason/caveman-qwen3.6-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for njmason/caveman-qwen3.6-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for njmason/caveman-qwen3.6-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for njmason/caveman-qwen3.6-GGUF to start chatting

Pi new

How to use njmason/caveman-qwen3.6-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf njmason/caveman-qwen3.6-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "caveman-qwen3.6-GGUF"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Docker Model Runner
How to use njmason/caveman-qwen3.6-GGUF with Docker Model Runner:
```
docker model run hf.co/njmason/caveman-qwen3.6-GGUF:Q4_K_M
```

Lemonade

How to use njmason/caveman-qwen3.6-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull njmason/caveman-qwen3.6-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.caveman-qwen3.6-GGUF-Q4_K_M

List all available models

lemonade list

🪨 caveman-qwen3.6 — GGUF

llama.cpp-compatible quantized GGUFs of njmason/caveman-qwen3.6 — a brevity-trained QLoRA fine-tune of unsloth/Qwen3.6-35B-A3B.

The adapter has been merged into the base model and converted to GGUF for direct inference via llama.cpp (CPU, Metal, Vulkan, CUDA).

For background on the fine-tune (training data, methodology, base-vs-trained comparison), see the adapter repo.

Why this exists

Standard LLMs are verbose. caveman-qwen3.6 has the brevity behavior baked into the weights — no system prompt required. In smoke-test comparisons against the base Qwen3.6-35B-A3B, output length dropped 75-90% with no observed correctness loss.

Prompt	Base (words)	caveman-qwen3.6 (words)
"How do I reverse a string in Python?"	98+	8
"What is the capital of Japan?"	53	1
"Write a function that returns true if a number is even."	22	10
"How do I list files larger than 100MB on Linux?"	114	9

Available quantizations

File	Size	BPW	Notes
`caveman-qwen3.6-BF16.gguf`	~70 GB	16.01	Full precision. For benchmarking + further quantization.
`caveman-qwen3.6-Q8_0.gguf`	~37 GB	8.5	Near-lossless. Recommended if VRAM/RAM allows.
`caveman-qwen3.6-Q5_K_M.gguf`	~25 GB	5.7	High-quality; minor degradation vs Q8_0.
`caveman-qwen3.6-Q4_K_M.gguf`	~21 GB	4.88	Recommended default. Balanced size/quality, fits 24GB consumer GPUs (RTX 4090, etc.)
`caveman-qwen3.6-Q3_K_M.gguf`	~16 GB	~3.8	Smaller, more degradation. For tight VRAM.

Sizes approximate. All quants produced via llama.cpp/llama-quantize from the BF16 source.

Hardware fit

Qwen3.6-35B-A3B is a Mixture-of-Experts model: 35B total parameters, ~3B active per token. Quantization compresses all 35B params, but inference compute scales with the 3B active set, making this model exceptionally fast for its parameter count.

Approximate VRAM/RAM requirements for inference (varies with context length):

Q3_K_M — ~18 GB (RTX 4080, M2 Pro 32GB)
Q4_K_M — ~23 GB (RTX 4090, M3 Max 36GB)
Q5_K_M — ~28 GB (RTX 5090, M3 Max 64GB)
Q8_0 — ~40 GB (A100 40GB, M3 Ultra)
BF16 — ~72 GB (A100 80GB, H100, M3 Ultra max)

CPU-only inference works but is slow. Apple Silicon Metal and CUDA are well-supported.

Usage with llama.cpp

# Download (replace Q4_K_M with your chosen quant)
hf download njmason/caveman-qwen3.6-GGUF caveman-qwen3.6-Q4_K_M.gguf --local-dir ./

# Run with llama.cpp CLI
./llama-cli \
  --model caveman-qwen3.6-Q4_K_M.gguf \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --presence-penalty 1.5 \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --chat-template-kwargs '{"enable_thinking":false}'

enable_thinking=false is recommended for the terse-by-default behavior. With thinking enabled, the model still reasons internally — useful for harder problems.

For the OpenAI-compatible server:

./llama-server \
  --model caveman-qwen3.6-Q4_K_M.gguf \
  --alias "njmason/caveman-qwen3.6" \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --port 8080 \
  --chat-template-kwargs '{"enable_thinking":false}'

MoE expert offloading (limited VRAM)

To fit on smaller GPUs, offload expert FFN layers to CPU and keep dense layers on GPU:

./llama-cli \
  --model caveman-qwen3.6-Q4_K_M.gguf \
  -ot ".ffn_.*_exps.=CPU" \
  --n-gpu-layers 99 \
  ... # other args

This leverages the MoE active-set size (only 3B params active per token) — most experts can sit in CPU RAM and only the routed ones get pulled in per generation step.

Sampling recommendations

Inherited from the base Qwen3.6-A3B chat config:

Non-thinking mode (enable_thinking=false, recommended for caveman behavior):

temperature = 0.7
top_p = 0.8
top_k = 20
min_p = 0.0
presence_penalty = 1.5

Thinking mode (enable_thinking=true):

temperature = 0.6
top_p = 0.95
top_k = 20
min_p = 0.0
presence_penalty = 1.5

For precise coding tasks, drop presence_penalty to 0.0 and temperature to 0.6.

Limitations & Caveats

MoE adapter is attention-only. The expert FFN weights were not adapted during training (Axolotl's ScatterMoE LoRA on attention only). Brevity emerged anyway from attention-level adaptation.
Small training dataset — 1,500 synthetic pairs. May not generalize perfectly to all domains.
Extreme brevity may omit context — not suited for tutorials, education, compliance docs, creative writing, or analysis essays.
Not formally benchmarked. Smoke tested on 5 prompts only. No MMLU / HumanEval / etc. runs against the trained model. Production users should evaluate on their own task distribution.
Vision capability untested. The base model is multimodal (Qwen3.6 VL); fine-tuning was text-only and the vision pathway was not exercised post-training. The vision tower remains in the merged weights (and is included in the GGUF) but has not been validated post-fine-tune.

License

Apache-2.0 (matches base model).

Citation

@misc{caveman-qwen3.6-gguf,
  author = {Nick Mason},
  title = {caveman-qwen3.6 GGUF: Quantized brevity-trained variants of Qwen3.6-35B-A3B},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/njmason/caveman-qwen3.6-GGUF}
}

Inspired by Mintzs/oogaboogalm, itself inspired by JuliusBrussee/caveman.