🪨 caveman-qwen3.6 — GGUF
llama.cpp-compatible quantized GGUFs of njmason/caveman-qwen3.6 — a brevity-trained QLoRA fine-tune of unsloth/Qwen3.6-35B-A3B.
The adapter has been merged into the base model and converted to GGUF for direct inference via llama.cpp (CPU, Metal, Vulkan, CUDA).
For background on the fine-tune (training data, methodology, base-vs-trained comparison), see the adapter repo.
Why this exists
Standard LLMs are verbose. caveman-qwen3.6 has the brevity behavior baked into the weights — no system prompt required. In smoke-test comparisons against the base Qwen3.6-35B-A3B, output length dropped 75-90% with no observed correctness loss.
| Prompt | Base (words) | caveman-qwen3.6 (words) |
|---|---|---|
| "How do I reverse a string in Python?" | 98+ | 8 |
| "What is the capital of Japan?" | 53 | 1 |
| "Write a function that returns true if a number is even." | 22 | 10 |
| "How do I list files larger than 100MB on Linux?" | 114 | 9 |
Available quantizations
| File | Size | BPW | Notes |
|---|---|---|---|
caveman-qwen3.6-BF16.gguf |
~70 GB | 16.01 | Full precision. For benchmarking + further quantization. |
caveman-qwen3.6-Q8_0.gguf |
~37 GB | 8.5 | Near-lossless. Recommended if VRAM/RAM allows. |
caveman-qwen3.6-Q5_K_M.gguf |
~25 GB | 5.7 | High-quality; minor degradation vs Q8_0. |
caveman-qwen3.6-Q4_K_M.gguf |
~21 GB | 4.88 | Recommended default. Balanced size/quality, fits 24GB consumer GPUs (RTX 4090, etc.) |
caveman-qwen3.6-Q3_K_M.gguf |
~16 GB | ~3.8 | Smaller, more degradation. For tight VRAM. |
Sizes approximate. All quants produced via llama.cpp/llama-quantize from the BF16 source.
Hardware fit
Qwen3.6-35B-A3B is a Mixture-of-Experts model: 35B total parameters, ~3B active per token. Quantization compresses all 35B params, but inference compute scales with the 3B active set, making this model exceptionally fast for its parameter count.
Approximate VRAM/RAM requirements for inference (varies with context length):
- Q3_K_M — ~18 GB (RTX 4080, M2 Pro 32GB)
- Q4_K_M — ~23 GB (RTX 4090, M3 Max 36GB)
- Q5_K_M — ~28 GB (RTX 5090, M3 Max 64GB)
- Q8_0 — ~40 GB (A100 40GB, M3 Ultra)
- BF16 — ~72 GB (A100 80GB, H100, M3 Ultra max)
CPU-only inference works but is slow. Apple Silicon Metal and CUDA are well-supported.
Usage with llama.cpp
# Download (replace Q4_K_M with your chosen quant)
hf download njmason/caveman-qwen3.6-GGUF caveman-qwen3.6-Q4_K_M.gguf --local-dir ./
# Run with llama.cpp CLI
./llama-cli \
--model caveman-qwen3.6-Q4_K_M.gguf \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--presence-penalty 1.5 \
--ctx-size 16384 \
--n-gpu-layers 99 \
--chat-template-kwargs '{"enable_thinking":false}'
enable_thinking=false is recommended for the terse-by-default behavior. With thinking enabled, the model still reasons internally — useful for harder problems.
For the OpenAI-compatible server:
./llama-server \
--model caveman-qwen3.6-Q4_K_M.gguf \
--alias "njmason/caveman-qwen3.6" \
--ctx-size 16384 \
--n-gpu-layers 99 \
--port 8080 \
--chat-template-kwargs '{"enable_thinking":false}'
MoE expert offloading (limited VRAM)
To fit on smaller GPUs, offload expert FFN layers to CPU and keep dense layers on GPU:
./llama-cli \
--model caveman-qwen3.6-Q4_K_M.gguf \
-ot ".ffn_.*_exps.=CPU" \
--n-gpu-layers 99 \
... # other args
This leverages the MoE active-set size (only 3B params active per token) — most experts can sit in CPU RAM and only the routed ones get pulled in per generation step.
Sampling recommendations
Inherited from the base Qwen3.6-A3B chat config:
Non-thinking mode (enable_thinking=false, recommended for caveman behavior):
temperature= 0.7top_p= 0.8top_k= 20min_p= 0.0presence_penalty= 1.5
Thinking mode (enable_thinking=true):
temperature= 0.6top_p= 0.95top_k= 20min_p= 0.0presence_penalty= 1.5
For precise coding tasks, drop presence_penalty to 0.0 and temperature to 0.6.
Limitations & Caveats
- MoE adapter is attention-only. The expert FFN weights were not adapted during training (Axolotl's ScatterMoE LoRA on attention only). Brevity emerged anyway from attention-level adaptation.
- Small training dataset — 1,500 synthetic pairs. May not generalize perfectly to all domains.
- Extreme brevity may omit context — not suited for tutorials, education, compliance docs, creative writing, or analysis essays.
- Not formally benchmarked. Smoke tested on 5 prompts only. No MMLU / HumanEval / etc. runs against the trained model. Production users should evaluate on their own task distribution.
- Vision capability untested. The base model is multimodal (Qwen3.6 VL); fine-tuning was text-only and the vision pathway was not exercised post-training. The vision tower remains in the merged weights (and is included in the GGUF) but has not been validated post-fine-tune.
License
Apache-2.0 (matches base model).
Citation
@misc{caveman-qwen3.6-gguf,
author = {Nick Mason},
title = {caveman-qwen3.6 GGUF: Quantized brevity-trained variants of Qwen3.6-35B-A3B},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/njmason/caveman-qwen3.6-GGUF}
}
Inspired by Mintzs/oogaboogalm, itself inspired by JuliusBrussee/caveman.
- Downloads last month
- 776
3-bit
4-bit
5-bit
8-bit
16-bit