Gemma 4 E2B — Cerebellum v2 GGUF (3.0 GB)

Ablation-informed mixed-precision quantization of google/gemma-4-e2b-it. 3.0 GB file size, 139.69 perplexity — smaller than stock Q3_K_M (3.06 GB) with 24% lower perplexity.

Three ffn_gate layers identified by per-layer ablation as actively benefiting from Q2_K demotion. Not a blanket crush — surgical precision informed by 35-layer sensitivity sweep.

Benchmarks

Benchmark Cerebellum v2 Q3_K_M Baseline Delta
Perplexity (WikiText-2, 2048 ctx) 139.69 184.93 -24.4%
HumanEval pass@1 46.3% 46.3% 0.0%
ARC-Challenge 71.9% 71.9% 0.0%
HellaSwag 50.0% 50.0% 0.0%
MMLU-Redux 47.4% 47.6% -0.2%

All benchmarks measured directly on this file. Identical benchmark performance at smaller size and significantly lower perplexity.

Why This Works

Standard quantization treats all layers identically. Cerebellum runs a per-layer ablation sweep — testing each layer's ffn_gate individually at Q2_K — and discovers that certain mid-network layers actually produce lower perplexity when crushed harder. This is a regularization effect: the gate tensors at layers 11, 13, and 14 (31-40% depth) carry redundant precision that creates noise at Q3_K_M.

The Regularization Effect

When we tested all 35 layers individually:

Layer PPL at Q2_K vs Baseline (184.93) Effect
blk.11 169.18 -8.5% Regularization
blk.13 170.71 -7.7% Regularization
blk.14 172.70 -6.6% Regularization
blk.12 176.27 -4.7% Mild benefit
blk.0 169.21 -8.5% Regularization
blk.30+ 200+ +8%+ Damage

Layers 11, 13, 14 form a cluster in the mid-network where gate tensor precision actively hurts. Combining all three gives PPL 139.69 — the effects stack.

v1 vs v2: Proof That Precision Matters

Version Method PPL HumanEval ARC HellaSwag MMLU
Baseline Stock Q3_K_M 184.93 46.3% 71.9% 50.0% 47.6%
v1 All 35 ffn_gate → Q2_K 139.34 17.7% 64.9% 39.9% 43.6%
v2 3 layers → Q2_K 139.69 46.3% 71.9% 50.0% 47.4%

v1 proved that blanket-crushing all ffn_gate tensors improves PPL but destroys benchmarks — same-layer interaction effects between simultaneously crushed tensors cause cascading damage. v2 proves that surgical, ablation-guided demotion captures the same PPL improvement with zero benchmark loss.

Architecture Family Recipe Transfer

This model validates that Cerebellum recipes transfer within architecture families. On Gemma 4 E4B (42 layers), the sweet spot for ffn_gate demotion is layers 14-17 (33-40% depth). On E2B (35 layers), it's layers 11-14 (31-40% depth). Same proportional position, same effect.

This means ablation results on one model in a family can inform the starting configuration for smaller/larger siblings — reducing the search space from O(layers) to O(1) confirmation.

The Override File

blk.11.ffn_gate.weight=Q2_K
blk.13.ffn_gate.weight=Q2_K
blk.14.ffn_gate.weight=Q2_K

Three lines. That's the entire recipe.

VRAM Requirements

Context VRAM
2K ~4 GB
8K ~5 GB
16K ~6 GB

Fits on a 4 GB GPU at short context. Ideal for edge deployment.

Usage

# llama.cpp
llama-server \
  --model Gemma-4-E2B-it-Cerebellum-v2.gguf \
  --n-gpu-layers 99 \
  --ctx-size 8192

# Ollama
echo 'FROM ./Gemma-4-E2B-it-Cerebellum-v2.gguf' > Modelfile
ollama create gemma4-e2b -f Modelfile
ollama run gemma4-e2b

Reproducing This Quant

# 1. Get Q3_K_M baseline (or quantize from BF16)
# 2. Create override file:
echo "blk.11.ffn_gate.weight=Q2_K
blk.13.ffn_gate.weight=Q2_K
blk.14.ffn_gate.weight=Q2_K" > cerebellum_v2_overrides.txt

# 3. Requantize with overrides
llama-quantize \
  --allow-requantize \
  --tensor-type-file cerebellum_v2_overrides.txt \
  google_gemma-4-E2B-it-Q3_K_M.gguf \
  Gemma-4-E2B-it-Cerebellum-v2.gguf Q3_K_M

Files

File Size Description
Gemma-4-E2B-it-Cerebellum-v2.gguf 3.0 GB The quantized model
cerebellum_v2_overrides.txt 87 B 3 tensor type overrides

Model Details

  • Base model: google/gemma-4-e2b-it
  • Architecture: Dense transformer with PLE, 35 layers, 608 tensors
  • Quantization: Q3_K_M base with 3 ffn_gate tensors demoted to Q2_K
  • Method: Per-layer ablation sweep identifying regularization candidates
  • Vocabulary: 262,144 tokens (text + vision + audio)
  • File format: GGUF v3

Test Hardware

Component Spec
GPU NVIDIA RTX 3090 (24 GB)
CPU AMD Ryzen 7 5800XT
RAM 64 GB DDR4
OS Fedora Linux 43 (Atomic)

Attribution

License

Gemma License

Downloads last month
412
GGUF
Model size
5B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for deucebucket/Gemma-4-E2B-it-Cerebellum-v2-GGUF

Quantized
(176)
this model