How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF",
	filename="gemma-4-26B-A4B-it-cerebellum-v6.gguf",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Gemma 4 26B-A4B-it — Cerebellum v6 GGUF

Numbers under audit (2026-05-08) — an internal review found the v6 benchmark numbers below need to be re-measured against the same protocol used for v1–v4. A clean re-run with audited wrong-answers and per-question JSONLs is underway. The GGUF file itself is unchanged — this is a measurement issue, not a model issue. Treat the table as preliminary until corrected numbers replace it.

Cerebellum v6 is an ablation-guided mixed-precision GGUF quantization of google/gemma-4-26B-A4B-it.

This is a 26B-parameter MoE model with 4B active parameters per token, 128 experts per layer, and 30 layers. This release uses tensor-level precision overrides selected from 140+ ablation experiments across six internal iterations, including per-layer MoE router surgery.

At a Glance

File gemma-4-26B-A4B-it-cerebellum-v6.gguf
Size 11.7 GB
Base model google/gemma-4-26B-A4B-it
Base quant Q3_K_M with bartowski's imatrix
Format GGUF, mixed precision
Test hardware RTX 3090, llama.cpp

Benchmarks

Benchmark Result
WikiText PPL 12,054
HumanEval pass@1 72.0%
ARC-Challenge 95.6%
HellaSwag 84.7%
MMLU-Redux 71.2%

All results measured locally on an RTX 3090 with llama.cpp. PPL was measured on the WikiText-2 test set with 2048 context and 128 chunks.

PPL is high in absolute terms for this model. This appears consistent across Gemma 4 26B quant levels tested locally and may reflect the model's MoE routing behavior on WikiText specifically.

What Changed: v1 Through v6

Each version added a new layer of ablation data. The method is always the same: change one thing, measure PPL, keep it only if it helps.

Version PPL HumanEval What Changed
v1 20,614 65.2% Group-level ablation: 5 tensor groups tested at Q2_K
v2 19,826 65.9% + attn_q per-layer ablation (30 layers tested, 9 promoted to Q5_K)
v3 19,826 67.1% + PLE protection (norms/scales forced to F32)
v4 12,614 69.5% + ffn_up per-layer ablation + precision rebalance
v5 12,356 71.3% + attn_k reverse ablation (30 layers tested, 7 promoted to Q3_K)
v6 12,054 72.0% + MoE router surgery: layer 8 ffn_gate_inp F32→Q8_0

How Cerebellum Works

Cerebellum assigns quantization precision per tensor based on measured impact. Each tensor group and individual layer is tested by changing its precision and measuring perplexity. Only changes that improve or maintain quality are kept.

Group Ablation

Each tensor category was tested at Q2_K and measured by PPL impact:

Group Tensors PPL Delta Action
attn_q 30 +13.4% Per-layer testing (9 layers need Q5_K)
ffn_gate 30 -1.2% Left at Q3_K
expert_gate_up 30 -5.5% Set to Q2_K
attn_k 30 -12.1% Per-layer testing (7 layers benefit from Q3_K)
ffn_up 30 -18.2% Set to Q2_K

Three of five tested groups had lower PPL at Q2_K — meaning Q3_K_M was using bits on tensors that don't need them.

Layer Ablation

Groups with mixed results were tested per layer:

  • attn_q: All 30 layers tested individually at Q2_K. 9 layers exceeded the sensitivity threshold and stay at Q5_K. The other 21 tolerate Q2_K.
  • attn_k: All 30 layers tested individually. 7 layers showed PPL improvement when promoted from Q2_K to Q3_K (layer 23: -3.8%, layer 18: -2.8%). 4 layers (5, 11, 16, 29) were confirmed better at Q2_K.

MoE Router Surgery (New in v6)

llama-quantize ignores --tensor-type-file overrides for ffn_gate_inp.weight (MoE router) tensors. We built gguf_tensor_surgery.py to recast individual tensors directly in the GGUF file.

All 30 router layers were tested individually at Q8_0 (F32→Q8_0):

Layer PPL Delta Category
8 12,054 -2.4% Best universal candidate
10 11,872 -3.9% Best PPL but regresses HumanEval (-9.7%)
6 11,988 -3.0% Win (not stacked — routing compensation)
9 12,044 -2.5% Win (not stacked)
12 12,041 -2.5% Win (not stacked)
23 12,052 -2.5% Win (not stacked)
0 12,974 +5.0% Sensitive
1 13,525 +9.5% Very sensitive
2 13,239 +7.1% Sensitive
4 13,047 +5.6% Sensitive

Why layer 8 and not layer 10: Layer 10 had the best PPL improvement (-3.9%), but full HumanEval testing showed it regresses code generation from 71.3% to 61.6%. Layer 10's router controls routing to code-relevant experts — degrading it hurts coding while helping general perplexity. Layer 8 improves PPL (-2.4%) AND HumanEval (+0.7%) with no regressions on any benchmark.

Router stacking doesn't work: Combined demotion of even the top 3 layers worsens PPL vs baseline. The model compensates for one degraded router but not multiple simultaneously. This is a routing compensation effect specific to MoE architectures.

Precision curve for layer 8's router:

Precision PPL Delta
F32 (default) 12,356
Q8_0 12,054 -2.4%
Q4_0 12,355 ~0%
Q6_K 14,317 +15.9%
Q2_K 14,482 +17.2%

Q8_0 is the only precision that improves PPL. K-quant formats (Q6_K, Q2_K) use 256-element super-blocks with sub-block scales — this structure disrupts the router's fine-grained expert selection. Q8_0's simpler per-block rounding acts as beneficial regularization.

Final Precision Map (v6)

Tensor Type Precision Count Rationale
attn_q (9 sensitive layers) Q5_K 9 Layer-validated critical
attn_q (remaining) Q2_K 21 Group-level demotable
attn_k (7 promoted layers) Q3_K 7 Reverse ablation: improve when promoted
attn_k (remaining) Q2_K 23 Group-level demotable
ffn_up Q2_K 30 Group PPL delta: -18.2%
expert_gate_up Q2_K 30 Group PPL delta: -5.5%
ffn_gate Q3_K 30 Tolerant (-1.2%)
ffn_gate_inp layer 8 (router) Q8_0 1 Per-layer surgery: -2.4% PPL, +0.7% HumanEval
ffn_gate_inp (router, other) F32 29 Group PPL delta: +30.7% when crushed
Norms, scales F32 392 Structural — always full precision

91 tensor-level overrides + 1 surgical router recast on top of Q3_K_M base.

Usage

# llama.cpp
./llama-server -m gemma-4-26B-A4B-it-cerebellum-v6.gguf -ngl 99 -c 4096

# ollama
ollama create gemma4-cerebellum -f Modelfile
ollama run gemma4-cerebellum

Fits in 24 GB VRAM at full GPU offload with room for 4K context.

Technical Details

  • Architecture: Gemma 4 26B — 26B total params, 4B active per token, 128 experts/layer, 30 layers
  • Base quant: Q3_K_M with bartowski imatrix
  • Ablation experiments: 140+ across 6 iterations (including 30-layer router surgery)
  • Quantizer: llama.cpp llama-quantize with --tensor-type-file overrides + gguf_tensor_surgery.py for router recast
  • Hardware: RTX 3090 (24 GB VRAM)

Credits

Downloads last month
1,701
GGUF
Model size
25B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF

Quantized
(182)
this model