Granite 4.0-H-Small — Cerebellum GGUF (14.2 GB)
Ablation-informed mixed-precision quantization of ibm-granite/granite-4.0-tiny-preview. File size: 14.2 GB. Measured WikiText-2 perplexity: 6.4580 (+1.90% vs Q3_K_M baseline of 6.3376).
This is the first Cerebellum build for a hybrid Mamba-2 + Transformer MoE architecture. The ablation revealed that routed expert weights are sensitive in this model while shared expert weights tolerate aggressive demotion — the opposite of what we expected from dense transformer MoE patterns.
Benchmarks
| Benchmark | Cerebellum (14.2 GB) |
|---|---|
| WikiText-2 PPL | 6.4580 |
| HellaSwag | 87.1% |
| ARC-Challenge | 90.7% |
| MMLU-Redux | 68.6% |
All benchmarks measured directly on this file.
What Changed
Three shared expert tensor groups (120 tensors) demoted from Q3_K to Q2_K:
| Group | Layers | PPL Delta | Size Saved |
|---|---|---|---|
shared_mlp.input_linear (gate) |
40 | — | — |
shared_mlp.input_linear (up) |
40 | — | — |
shared_mlp.output_linear (down) |
40 | — | — |
| Combined | 120 | +1.90% | 0.1 GB |
What We Tested (and Kept at Q3_K)
| Group | PPL Delta | Verdict |
|---|---|---|
| ffn_gate_exps (routed experts) | +4.95% | Keep at Q3_K |
| ssm_out (Mamba output) | +5.01% | Keep at Q3_K |
| ffn_up_exps (routed experts) | +6.04% | Keep at Q3_K |
| ffn_down_exps (routed experts) | +10.63% | Keep at Q3_K |
| ssm_in (Mamba input) | +12.85% | Keep at Q3_K |
Key Finding: Routed Experts Are Sensitive
In dense MoE models like Qwen 3.6 35B, expert gate/up/down weights tolerate Q2_K easily (+1-2%). In Granite 4.0-H-Small, the opposite is true:
- Shared experts (always active): Tolerant. Q2_K adds only +1.90% PPL.
- Routed experts (72 per layer, 10 active): Sensitive. Q2_K adds +5-13% PPL.
This is likely because Granite's expert FFN intermediate size is only 768 (vs 1536+ in larger models). Smaller weight matrices are more sensitive to quantization noise.
Architecture
| Parameter | Value |
|---|---|
| Total params | 32B |
| Active params | 9B per token |
| Layers | 40 |
| Experts | 72 per layer (10 active) |
| Full attention layers | 4 (positions 5, 15, 25, 35) |
| Mamba-2 layers | 36 |
| Context | 128K |
How to Run
llama-server --model Granite-4.0-H-Small-Cerebellum.gguf -ngl 99 --ctx-size 4096
Fits on a 24GB GPU with room for context.
Method: Cerebellum
Cerebellum is sensitivity-guided mixed-precision quantization. We measure the PPL impact of demoting each tensor group individually, then only demote groups that stay under a threshold. Sacred tensors (routers, norms, embeddings) are never touched.
Steps:
- Start from high-quality imatrix Q3_K_M base
- Group tensors by function (120 shared experts, 120 routed experts, 72 Mamba projections, etc.)
- Demote each group to Q2_K individually and measure PPL delta
- Only ship groups that pass the threshold (+3% max)
- Verify combined build doesn't compound beyond acceptable range
Quantized by deucebucket using Cerebellum methodology.
- Downloads last month
- 289
We're not able to determine the quantization variants.
Model tree for deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF
Base model
ibm-granite/granite-4.0-tiny-base-preview