Gemma 4 A4B 98-Expert v3 (20.8B)

Expert-pruned version of google/gemma-4-26B-A4B-it, reduced from 128 to 98 experts per MoE layer using a contribution-weighted importance map aggregated across all task categories (math, logic, code, science, creative).

	Original (128e)	109e v3	This model (98e v3)
Total params	26B	22.4B	~20.8B
Experts per layer	128	109	98
Experts dropped	—	19/layer	30/layer
MoE capacity removed	—	14.8%	23.4%
Top-k routing	8	8	8
GPQA Diamond (Q6_K)	75.25%	71.72%	75.25%

Zero GPQA degradation despite dropping 30 experts per layer (23.4% of MoE capacity). This model matches the full 128-expert reference exactly on GPQA Diamond.

GGUF quantizations available at ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF — includes standard Bartowski quants + ContribDynamic (CD) per-layer quants.

Pruning Method

Contribution-Weighted Expert Analysis

The drop map is derived from expert_neuron_v4.json — a comprehensive per-expert contribution analysis across all task categories (math, logic, code, science, creative) using 128-token teacher-force analysis on the full 128-expert reference model.

Process (scripts/expert_drop.py):

Contribution scoring: For each expert in each layer, the total contribution (tc field) is computed as the sum of weighted output norms across all task categories.
Per-layer ranking: Experts are ranked by total contribution within each layer.
Drop decision: The 30 lowest-contributing experts per layer are dropped (128 → 98).
Router resize: The MoE router proj.weight is resized from [128, hidden] to [98, hidden], keeping only rows for retained experts. The top-8 routing naturally adapts.

Why 98e Works Better Than 109e

The 98e v3 model uses a different importance map than 109e v3:

109e v3: Uses a per-question top-16 protection scheme on GPQA Diamond specifically (teacher-force analysis, 196 questions). This over-fits the drop map to GPQA.
98e v3: Uses expert_neuron_v4.json which aggregates contribution across all task categories with 128-token analysis windows. This produces a more generalizable importance ranking.

The result: 98e v3 matches the 128e reference (75.25%) while dropping 30 experts, whereas 109e v3 drops only 19 experts but loses 3.53 pp. The broader importance map makes better pruning decisions.

Key Findings

Experts are NOT topic-specialized: Top-32 overlap is 28/32 between math and creative domains. The same experts are important across tasks.
Contribution is moderately concentrated: Gini coefficient ~0.38. You need ~75 experts per layer for 80% of the contribution — the bottom 30 carry very little signal.
Expert weight similarity is near zero: Max cosine similarity ~0.05 between expert weight matrices. Merging experts by averaging destroys the model. Expert dropping is the only viable structural compression.
Early layers matter most: Layer 0 has highest importance (1.0), layers 28-29 are lowest (~0.04-0.05). But the drop is applied uniformly across layers.

GPQA Diamond Evaluation

Setup (identical methodology to 109e v3)

Quantization: GGUF Q6_K via llama.cpp llama-quantize (imatrix calibration)
Inference: llama.cpp llama-server (OpenAI-compatible API)
Evaluation: lm-evaluation-harness, task gpqa_diamond_cot_zeroshot
GPU: NVIDIA RTX 3090 (24 GB)

Configuration

Parameter	Value
Context size	32768 tokens
Reasoning format	`deepseek`
Reasoning budget	8192 tokens
Temperature	1.0 (Gemma 4 official)
top_p	0.95
top_k	64
DRY multiplier	0.5
Tokenizer	`google/gemma-4-26B-A4B-it` (original)

Results

Model	Experts/Layer	GPQA Diamond (flex)	Delta vs 128e
gemma-4-26B-A4B-it (128e ref)	128	75.25%	—
gemma-4-A4B-98e-v3-it (this)	98	75.25%	+0.00 pp
gemma-4-A4B-109e-v3-it	109	71.72%	-3.53 pp
gemma-4-E4B (small Gemma 4)	—	57.07%	-18.18 pp

Code Benchmarks (HumanEval + MBPP)

Code-generation benchmarks were run at BF16 via vLLM (TP=4 on 4× RTX 3090) using a chat-completions pipeline that bypasses the llama.cpp Gemma 4 reasoning-token / shared-KV bugs. Both 128e (reference) and 98e v3 (this model) were evaluated with identical methodology for an apples-to-apples comparison.

Setup

Precision: BF16 (no quantization) — eliminates Q6_K confounders
Inference: vLLM TP=4, /v1/chat/completions endpoint, native chat template
Evaluation: lm-evaluation-harness with a custom humaneval_instruct_chat task that overrides the stop sequence to ["\n```"] (closing code fence) so the model's full code body is captured before \ndef, \nclass, etc. trigger early stop
MBPP: same vLLM endpoint, default lm-eval mbpp task
Decoding: greedy (temperature=0.0, top_p=1.0), max_gen_toks=2048
HumanEval rescoring: lm-eval's build_predictions_instruct filter strips at the opening ``` (dropping the body); a local rescore (scripts/rescore_humaneval_strip_fences.py --from-raw) reads resps[0][0], strips both opening + closing fences, and re-runs exec(prompt + completion + check) with a 10s timeout
MBPP rescoring: same fence-strip applied to samples_mbpp.jsonl, since chat-mode also wraps MBPP responses in ```python blocks

Results (BF16, fence-strip rescored)

Model	HumanEval	MBPP	GPQA Diamond
gemma-4-26B-A4B-it (128e ref)	76.83%	89.60%	75.25%
gemma-4-A4B-98e-v3-it (this)	73.78%	85.60%	75.25%
Δ vs 128e	−3.05 pp (−4.0% rel)	−4.00 pp (−4.5% rel)	+0.00 pp

Why these numbers replace the previous Q6_K table

An earlier version of this card reported HumanEval 10.37% / MBPP 20.60% for 98e v3 under Q6_K + llama.cpp + raw /v1/completions. Those numbers were artifacts of the eval pipeline, not the model:

Greedy chat-mode generation under llama.cpp's Gemma 4 path leaks reasoning tokens, hits <unused> floods, and produces markdown ```python fences that lm-eval's exec(prompt + completion + tests) scorer treats as SyntaxError.
The 128e reference suffered from the same pipeline (44.40% Q6_K → 89.60% BF16, +45 pp from the methodology fix alone), so the relative Δ between 128e and 98e was inflated by infrastructure noise on top of the actual pruning cost.
BF16 + vLLM + chat-completions + the humaneval_instruct_chat stop-sequence override + fence-strip rescore eliminates all of those at once.

The clean reading is what's in the table above: 23.4% expert reduction costs ~4–5% relative on code generation, zero on knowledge recall.

For the original llama.cpp Gemma 4 issues that motivated the BF16 rerun:

Architecture

Unchanged from the original except num_experts: 98 (was 128):

Layers: 30
Hidden size: 2816
Expert intermediate size: 704 per expert
Dense MLP intermediate size: 2112 (always active)
Top-k routing: 8 (of 98 available)
Attention: Hybrid sliding (5) + global (1) pattern, head_dim=512
Vocabulary: 262,144

Files

config.json — Model config with num_experts: 98
model-0000N-of-00009.safetensors — Model weights (bf16)
expert_drop.py — Deterministic expert pruning script

Supplementary llama.cpp Q6_K snapshot (2026-05-11) — tainted, do not cite

These numbers are appended for archival reference only and DO NOT supersede the previously published evaluations above. They were collected with a llama.cpp build whose Gemma 4 support is incomplete; multiple known issues (reasoning-token leaks, <unused> flood, fence drift, cache-reuse gaps) are still active in this serving path. The proper way to read this section is "what numbers does the llama.cpp pipeline produce today", not "what is the true capability of the model".

llama.cpp tag/build: b9095-2-g0b04728 (build 590), CUDA backend on RTX 3090.
Serving profile: --reasoning-format deepseek --reasoning-budget 8192, chat-completions endpoint, q8_0 KV cache, --parallel 2.
Sampler: greedy for HE/MBPP/LCB (--temp 0 --top-p 1 --top-k 0 --seed 42), gemma4 preset for GPQA (T=1.0 / top_p=0.95 / top_k=64).
Scoring: lm-evaluation-harness --apply_chat_template, --use_cache, --log_samples, no fence-strip rescore applied here (chat-completions cleanly returns resps).
A vLLM 4-bit cross-validation of these same benchmarks is scheduled; if vLLM agrees, the deltas below are model-level; if not, the deltas are pipeline-level and the canonical (published) numbers stand.

Scores

Bench (n)	128e Q6_K	98e-v3 Q6_K	98e-v4 (cd-max) Q6_K
HumanEval-chat @3072 (164)	97.56%	73.78%	96.34%
MBPP-chat (500)	79.20% ±1.82	— (no clean chat run)	76.00% ±1.91
LCB-medium @8k (55)	87.27% (48/55)	—	78.18% (43/55)
LCB-medium @16k (55)	in flight	—	78.18% (43/55)
GPQA-Diamond (198)	65.66% ±3.38 (v4-proto)	75.25% ±3.07 (legacy proto)	77.27% ±2.99

The 128e GPQA at 65.66% under the v4 protocol is well below the canonical 75.25% Q6_K and is the clearest evidence here that the llama.cpp serving path is the failure mode, not the model. The 75.25% legacy number remains the value to cite for 128e. The 98e-v3 row reuses the published GPQA result (run under the legacy llama.cpp protocol) for completeness; it is not re-measured here.

Output tokens per response (samples-derived)

LCB tokens are real (completion_tokens from the patched runner). HE / MBPP / GPQA are approximated as len(resps) / 4 from samples_*.jsonl; this is within ±15% of true Gemma 4 tokens.

Bench	Variant	median	mean	p95	max	total (n × mean)
HE-chat	128e	313	334	681	917	54.9k
HE-chat	98e-v3	490	512	953	1013	84.0k
HE-chat	98e-v4	303	340	755	895	55.9k
MBPP-chat	128e	194	224	453	532	112k
MBPP-chat	98e-v3	129	356	1328	1892	178k (raw-protocol outliers)
MBPP-chat	98e-v4	165	206	455	530	103k
LCB-med @8k	128e	1174	2167	7949	8192	119k
LCB-med @8k	98e-v4	2829	3667	8192	8192	202k
LCB-med @16k	98e-v4	2829	4913	15983	16064	270k
GPQA-D	128e	749	948	2098	4576	375k
GPQA-D	98e-v3	648	655	815	1076	250k
GPQA-D	98e-v4	676	783	950	5495	305k

Raising the LCB cap from 8k to 16k for 98e-v4 did not change the score (43/55 both times). The cap was binding for 22/55 problems at 8k but every truncated answer was already on a wrong trajectory; the failures are real, not truncation.

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "ManniX-ITA/gemma-4-A4B-98e-v3-it",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager",  # Gemma 4 head_dim=512 not supported by FA2
)
tok = AutoTokenizer.from_pretrained("ManniX-ITA/gemma-4-A4B-98e-v3-it")

msgs = [{"role": "user", "content": "Explain the Heisenberg uncertainty principle."}]
inputs = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=512, do_sample=True, temperature=1.0, top_p=0.95, top_k=64)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))

llama.cpp (recommended)

llama-server -m gemma-4-A4B-98e-v3-it-Q6_K.gguf \
    --port 8099 -c 32768 -ngl 99 --no-warmup \
    --reasoning-format deepseek --reasoning-budget 8192 \
    --temp 1.0 --top-p 0.95 --top-k 64 --dry-multiplier 0.5

GGUF quants: ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF

Related Models

Model	Description
gemma-4-A4B-109e-v3-it	109 experts (19 dropped), clean teacher-force map
gemma-4-A4B-109e-v3-it-GGUF	GGUF quants for 109e v3
gemma-4-A4B-98e-v3-it-GGUF	GGUF quants for this model

License

This model inherits the Gemma license from the base model.

Acknowledgements

Google for the base Gemma 4 26B-A4B-it model
The GPQA Diamond benchmark (Rein et al., 2023)
bartowski for the calibration data v5 used in imatrix GGUF quantization

Downloads last month: 191

Safetensors

Model size

20B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/gemma-4-A4B-98e-v3-it

Base model

google/gemma-4-26B-A4B

Finetuned

google/gemma-4-26B-A4B-it

Finetuned

(81)

this model

Quantizations

1 model