Granite 4.0-H-Small — Cerebellum GGUF (14.2 GB)

Ablation-informed mixed-precision quantization of ibm-granite/granite-4.0-tiny-preview. File size: 14.2 GB. Measured WikiText-2 perplexity: 6.4580 (+1.90% vs Q3_K_M baseline of 6.3376).

This is the first Cerebellum build for a hybrid Mamba-2 + Transformer MoE architecture. The ablation revealed that routed expert weights are sensitive in this model while shared expert weights tolerate aggressive demotion — the opposite of what we expected from dense transformer MoE patterns.

Benchmarks

Benchmark Cerebellum (14.2 GB)
WikiText-2 PPL 6.4580
HellaSwag 87.1%
ARC-Challenge 90.7%
MMLU-Redux 68.6%

All benchmarks measured directly on this file.

What Changed

Three shared expert tensor groups (120 tensors) demoted from Q3_K to Q2_K:

Group Layers PPL Delta Size Saved
shared_mlp.input_linear (gate) 40
shared_mlp.input_linear (up) 40
shared_mlp.output_linear (down) 40
Combined 120 +1.90% 0.1 GB

What We Tested (and Kept at Q3_K)

Group PPL Delta Verdict
ffn_gate_exps (routed experts) +4.95% Keep at Q3_K
ssm_out (Mamba output) +5.01% Keep at Q3_K
ffn_up_exps (routed experts) +6.04% Keep at Q3_K
ffn_down_exps (routed experts) +10.63% Keep at Q3_K
ssm_in (Mamba input) +12.85% Keep at Q3_K

Key Finding: Routed Experts Are Sensitive

In dense MoE models like Qwen 3.6 35B, expert gate/up/down weights tolerate Q2_K easily (+1-2%). In Granite 4.0-H-Small, the opposite is true:

  • Shared experts (always active): Tolerant. Q2_K adds only +1.90% PPL.
  • Routed experts (72 per layer, 10 active): Sensitive. Q2_K adds +5-13% PPL.

This is likely because Granite's expert FFN intermediate size is only 768 (vs 1536+ in larger models). Smaller weight matrices are more sensitive to quantization noise.

Architecture

Parameter Value
Total params 32B
Active params 9B per token
Layers 40
Experts 72 per layer (10 active)
Full attention layers 4 (positions 5, 15, 25, 35)
Mamba-2 layers 36
Context 128K

How to Run

llama-server --model Granite-4.0-H-Small-Cerebellum.gguf -ngl 99 --ctx-size 4096

Fits on a 24GB GPU with room for context.

Method: Cerebellum

Cerebellum is sensitivity-guided mixed-precision quantization. We measure the PPL impact of demoting each tensor group individually, then only demote groups that stay under a threshold. Sacred tensors (routers, norms, embeddings) are never touched.

Steps:

  1. Start from high-quality imatrix Q3_K_M base
  2. Group tensors by function (120 shared experts, 120 routed experts, 72 Mamba projections, etc.)
  3. Demote each group to Q2_K individually and measure PPL delta
  4. Only ship groups that pass the threshold (+3% max)
  5. Verify combined build doesn't compound beyond acceptable range

Quantized by deucebucket using Cerebellum methodology.

Downloads last month
289
GGUF
Model size
32B params
Architecture
granitehybrid
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF