Gemma 4 E2B — Cerebellum v2 GGUF (3.0 GB)
Ablation-informed mixed-precision quantization of google/gemma-4-e2b-it. 3.0 GB file size, 139.69 perplexity — smaller than stock Q3_K_M (3.06 GB) with 24% lower perplexity.
Three ffn_gate layers identified by per-layer ablation as actively benefiting from Q2_K demotion. Not a blanket crush — surgical precision informed by 35-layer sensitivity sweep.
Benchmarks
| Benchmark | Cerebellum v2 | Q3_K_M Baseline | Delta |
|---|---|---|---|
| Perplexity (WikiText-2, 2048 ctx) | 139.69 | 184.93 | -24.4% |
| HumanEval pass@1 | 46.3% | 46.3% | 0.0% |
| ARC-Challenge | 71.9% | 71.9% | 0.0% |
| HellaSwag | 50.0% | 50.0% | 0.0% |
| MMLU-Redux | 47.4% | 47.6% | -0.2% |
All benchmarks measured directly on this file. Identical benchmark performance at smaller size and significantly lower perplexity.
Why This Works
Standard quantization treats all layers identically. Cerebellum runs a per-layer ablation sweep — testing each layer's ffn_gate individually at Q2_K — and discovers that certain mid-network layers actually produce lower perplexity when crushed harder. This is a regularization effect: the gate tensors at layers 11, 13, and 14 (31-40% depth) carry redundant precision that creates noise at Q3_K_M.
The Regularization Effect
When we tested all 35 layers individually:
| Layer | PPL at Q2_K | vs Baseline (184.93) | Effect |
|---|---|---|---|
| blk.11 | 169.18 | -8.5% | Regularization |
| blk.13 | 170.71 | -7.7% | Regularization |
| blk.14 | 172.70 | -6.6% | Regularization |
| blk.12 | 176.27 | -4.7% | Mild benefit |
| blk.0 | 169.21 | -8.5% | Regularization |
| blk.30+ | 200+ | +8%+ | Damage |
Layers 11, 13, 14 form a cluster in the mid-network where gate tensor precision actively hurts. Combining all three gives PPL 139.69 — the effects stack.
v1 vs v2: Proof That Precision Matters
| Version | Method | PPL | HumanEval | ARC | HellaSwag | MMLU |
|---|---|---|---|---|---|---|
| Baseline | Stock Q3_K_M | 184.93 | 46.3% | 71.9% | 50.0% | 47.6% |
| v1 | All 35 ffn_gate → Q2_K | 139.34 | 17.7% | 64.9% | 39.9% | 43.6% |
| v2 | 3 layers → Q2_K | 139.69 | 46.3% | 71.9% | 50.0% | 47.4% |
v1 proved that blanket-crushing all ffn_gate tensors improves PPL but destroys benchmarks — same-layer interaction effects between simultaneously crushed tensors cause cascading damage. v2 proves that surgical, ablation-guided demotion captures the same PPL improvement with zero benchmark loss.
Architecture Family Recipe Transfer
This model validates that Cerebellum recipes transfer within architecture families. On Gemma 4 E4B (42 layers), the sweet spot for ffn_gate demotion is layers 14-17 (33-40% depth). On E2B (35 layers), it's layers 11-14 (31-40% depth). Same proportional position, same effect.
This means ablation results on one model in a family can inform the starting configuration for smaller/larger siblings — reducing the search space from O(layers) to O(1) confirmation.
The Override File
blk.11.ffn_gate.weight=Q2_K
blk.13.ffn_gate.weight=Q2_K
blk.14.ffn_gate.weight=Q2_K
Three lines. That's the entire recipe.
VRAM Requirements
| Context | VRAM |
|---|---|
| 2K | ~4 GB |
| 8K | ~5 GB |
| 16K | ~6 GB |
Fits on a 4 GB GPU at short context. Ideal for edge deployment.
Usage
# llama.cpp
llama-server \
--model Gemma-4-E2B-it-Cerebellum-v2.gguf \
--n-gpu-layers 99 \
--ctx-size 8192
# Ollama
echo 'FROM ./Gemma-4-E2B-it-Cerebellum-v2.gguf' > Modelfile
ollama create gemma4-e2b -f Modelfile
ollama run gemma4-e2b
Reproducing This Quant
# 1. Get Q3_K_M baseline (or quantize from BF16)
# 2. Create override file:
echo "blk.11.ffn_gate.weight=Q2_K
blk.13.ffn_gate.weight=Q2_K
blk.14.ffn_gate.weight=Q2_K" > cerebellum_v2_overrides.txt
# 3. Requantize with overrides
llama-quantize \
--allow-requantize \
--tensor-type-file cerebellum_v2_overrides.txt \
google_gemma-4-E2B-it-Q3_K_M.gguf \
Gemma-4-E2B-it-Cerebellum-v2.gguf Q3_K_M
Files
| File | Size | Description |
|---|---|---|
Gemma-4-E2B-it-Cerebellum-v2.gguf |
3.0 GB | The quantized model |
cerebellum_v2_overrides.txt |
87 B | 3 tensor type overrides |
Model Details
- Base model: google/gemma-4-e2b-it
- Architecture: Dense transformer with PLE, 35 layers, 608 tensors
- Quantization: Q3_K_M base with 3 ffn_gate tensors demoted to Q2_K
- Method: Per-layer ablation sweep identifying regularization candidates
- Vocabulary: 262,144 tokens (text + vision + audio)
- File format: GGUF v3
Test Hardware
| Component | Spec |
|---|---|
| GPU | NVIDIA RTX 3090 (24 GB) |
| CPU | AMD Ryzen 7 5800XT |
| RAM | 64 GB DDR4 |
| OS | Fedora Linux 43 (Atomic) |
Attribution
- Google DeepMind — Gemma 4 base model
- llama.cpp — quantization and tensor type override support
License
- Downloads last month
- 412
We're not able to determine the quantization variants.