🪨 caveman-qwen3.6 — GGUF

llama.cpp-compatible quantized GGUFs of njmason/caveman-qwen3.6 — a brevity-trained QLoRA fine-tune of unsloth/Qwen3.6-35B-A3B.

The adapter has been merged into the base model and converted to GGUF for direct inference via llama.cpp (CPU, Metal, Vulkan, CUDA).

For background on the fine-tune (training data, methodology, base-vs-trained comparison), see the adapter repo.


Why this exists

Standard LLMs are verbose. caveman-qwen3.6 has the brevity behavior baked into the weights — no system prompt required. In smoke-test comparisons against the base Qwen3.6-35B-A3B, output length dropped 75-90% with no observed correctness loss.

Prompt Base (words) caveman-qwen3.6 (words)
"How do I reverse a string in Python?" 98+ 8
"What is the capital of Japan?" 53 1
"Write a function that returns true if a number is even." 22 10
"How do I list files larger than 100MB on Linux?" 114 9

Available quantizations

File Size BPW Notes
caveman-qwen3.6-BF16.gguf ~70 GB 16.01 Full precision. For benchmarking + further quantization.
caveman-qwen3.6-Q8_0.gguf ~37 GB 8.5 Near-lossless. Recommended if VRAM/RAM allows.
caveman-qwen3.6-Q5_K_M.gguf ~25 GB 5.7 High-quality; minor degradation vs Q8_0.
caveman-qwen3.6-Q4_K_M.gguf ~21 GB 4.88 Recommended default. Balanced size/quality, fits 24GB consumer GPUs (RTX 4090, etc.)
caveman-qwen3.6-Q3_K_M.gguf ~16 GB ~3.8 Smaller, more degradation. For tight VRAM.

Sizes approximate. All quants produced via llama.cpp/llama-quantize from the BF16 source.


Hardware fit

Qwen3.6-35B-A3B is a Mixture-of-Experts model: 35B total parameters, ~3B active per token. Quantization compresses all 35B params, but inference compute scales with the 3B active set, making this model exceptionally fast for its parameter count.

Approximate VRAM/RAM requirements for inference (varies with context length):

  • Q3_K_M — ~18 GB (RTX 4080, M2 Pro 32GB)
  • Q4_K_M — ~23 GB (RTX 4090, M3 Max 36GB)
  • Q5_K_M — ~28 GB (RTX 5090, M3 Max 64GB)
  • Q8_0 — ~40 GB (A100 40GB, M3 Ultra)
  • BF16 — ~72 GB (A100 80GB, H100, M3 Ultra max)

CPU-only inference works but is slow. Apple Silicon Metal and CUDA are well-supported.


Usage with llama.cpp

# Download (replace Q4_K_M with your chosen quant)
hf download njmason/caveman-qwen3.6-GGUF caveman-qwen3.6-Q4_K_M.gguf --local-dir ./

# Run with llama.cpp CLI
./llama-cli \
  --model caveman-qwen3.6-Q4_K_M.gguf \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --presence-penalty 1.5 \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --chat-template-kwargs '{"enable_thinking":false}'

enable_thinking=false is recommended for the terse-by-default behavior. With thinking enabled, the model still reasons internally — useful for harder problems.

For the OpenAI-compatible server:

./llama-server \
  --model caveman-qwen3.6-Q4_K_M.gguf \
  --alias "njmason/caveman-qwen3.6" \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --port 8080 \
  --chat-template-kwargs '{"enable_thinking":false}'

MoE expert offloading (limited VRAM)

To fit on smaller GPUs, offload expert FFN layers to CPU and keep dense layers on GPU:

./llama-cli \
  --model caveman-qwen3.6-Q4_K_M.gguf \
  -ot ".ffn_.*_exps.=CPU" \
  --n-gpu-layers 99 \
  ... # other args

This leverages the MoE active-set size (only 3B params active per token) — most experts can sit in CPU RAM and only the routed ones get pulled in per generation step.


Sampling recommendations

Inherited from the base Qwen3.6-A3B chat config:

Non-thinking mode (enable_thinking=false, recommended for caveman behavior):

  • temperature = 0.7
  • top_p = 0.8
  • top_k = 20
  • min_p = 0.0
  • presence_penalty = 1.5

Thinking mode (enable_thinking=true):

  • temperature = 0.6
  • top_p = 0.95
  • top_k = 20
  • min_p = 0.0
  • presence_penalty = 1.5

For precise coding tasks, drop presence_penalty to 0.0 and temperature to 0.6.


Limitations & Caveats

  • MoE adapter is attention-only. The expert FFN weights were not adapted during training (Axolotl's ScatterMoE LoRA on attention only). Brevity emerged anyway from attention-level adaptation.
  • Small training dataset — 1,500 synthetic pairs. May not generalize perfectly to all domains.
  • Extreme brevity may omit context — not suited for tutorials, education, compliance docs, creative writing, or analysis essays.
  • Not formally benchmarked. Smoke tested on 5 prompts only. No MMLU / HumanEval / etc. runs against the trained model. Production users should evaluate on their own task distribution.
  • Vision capability untested. The base model is multimodal (Qwen3.6 VL); fine-tuning was text-only and the vision pathway was not exercised post-training. The vision tower remains in the merged weights (and is included in the GGUF) but has not been validated post-fine-tune.

License

Apache-2.0 (matches base model).


Citation

@misc{caveman-qwen3.6-gguf,
  author = {Nick Mason},
  title = {caveman-qwen3.6 GGUF: Quantized brevity-trained variants of Qwen3.6-35B-A3B},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/njmason/caveman-qwen3.6-GGUF}
}

Inspired by Mintzs/oogaboogalm, itself inspired by JuliusBrussee/caveman.

Downloads last month
776
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for njmason/caveman-qwen3.6-GGUF

Quantized
(1)
this model