Gemma 4 A4B 98-Expert v3 (20.8B)

Expert-pruned version of google/gemma-4-26B-A4B-it, reduced from 128 to 98 experts per MoE layer using a contribution-weighted importance map aggregated across all task categories (math, logic, code, science, creative).

Original (128e) 109e v3 This model (98e v3)
Total params 26B 22.4B ~20.8B
Experts per layer 128 109 98
Experts dropped โ€” 19/layer 30/layer
MoE capacity removed โ€” 14.8% 23.4%
Top-k routing 8 8 8
GPQA Diamond (Q6_K) 75.25% 71.72% 75.25%

Zero GPQA degradation despite dropping 30 experts per layer (23.4% of MoE capacity). This model matches the full 128-expert reference exactly on GPQA Diamond.

GGUF quantizations available at ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF โ€” includes standard Bartowski quants + ContribDynamic (CD) per-layer quants.

Pruning Method

Contribution-Weighted Expert Analysis

The drop map is derived from expert_neuron_v4.json โ€” a comprehensive per-expert contribution analysis across all task categories (math, logic, code, science, creative) using 128-token teacher-force analysis on the full 128-expert reference model.

Process (scripts/expert_drop.py):

  1. Contribution scoring: For each expert in each layer, the total contribution (tc field) is computed as the sum of weighted output norms across all task categories.
  2. Per-layer ranking: Experts are ranked by total contribution within each layer.
  3. Drop decision: The 30 lowest-contributing experts per layer are dropped (128 โ†’ 98).
  4. Router resize: The MoE router proj.weight is resized from [128, hidden] to [98, hidden], keeping only rows for retained experts. The top-8 routing naturally adapts.

Why 98e Works Better Than 109e

The 98e v3 model uses a different importance map than 109e v3:

  • 109e v3: Uses a per-question top-16 protection scheme on GPQA Diamond specifically (teacher-force analysis, 196 questions). This over-fits the drop map to GPQA.
  • 98e v3: Uses expert_neuron_v4.json which aggregates contribution across all task categories with 128-token analysis windows. This produces a more generalizable importance ranking.

The result: 98e v3 matches the 128e reference (75.25%) while dropping 30 experts, whereas 109e v3 drops only 19 experts but loses 3.53 pp. The broader importance map makes better pruning decisions.

Key Findings

  • Experts are NOT topic-specialized: Top-32 overlap is 28/32 between math and creative domains. The same experts are important across tasks.
  • Contribution is moderately concentrated: Gini coefficient ~0.38. You need ~75 experts per layer for 80% of the contribution โ€” the bottom 30 carry very little signal.
  • Expert weight similarity is near zero: Max cosine similarity ~0.05 between expert weight matrices. Merging experts by averaging destroys the model. Expert dropping is the only viable structural compression.
  • Early layers matter most: Layer 0 has highest importance (1.0), layers 28-29 are lowest (~0.04-0.05). But the drop is applied uniformly across layers.

GPQA Diamond Evaluation

Setup (identical methodology to 109e v3)

  • Quantization: GGUF Q6_K via llama.cpp llama-quantize (imatrix calibration)
  • Inference: llama.cpp llama-server (OpenAI-compatible API)
  • Evaluation: lm-evaluation-harness, task gpqa_diamond_cot_zeroshot
  • GPU: NVIDIA RTX 3090 (24 GB)

Configuration

Parameter Value
Context size 32768 tokens
Reasoning format deepseek
Reasoning budget 8192 tokens
Temperature 1.0 (Gemma 4 official)
top_p 0.95
top_k 64
DRY multiplier 0.5
Tokenizer google/gemma-4-26B-A4B-it (original)

Results

Model Experts/Layer GPQA Diamond (flex) Delta vs 128e
gemma-4-26B-A4B-it (128e ref) 128 75.25% โ€”
gemma-4-A4B-98e-v3-it (this) 98 75.25% +0.00 pp
gemma-4-A4B-109e-v3-it 109 71.72% -3.53 pp
gemma-4-E4B (small Gemma 4) โ€” 57.07% -18.18 pp

Code Benchmarks (HumanEval + MBPP)

Code-generation benchmarks were run at BF16 via vLLM (TP=4 on 4ร— RTX 3090) using a chat-completions pipeline that bypasses the llama.cpp Gemma 4 reasoning-token / shared-KV bugs. Both 128e (reference) and 98e v3 (this model) were evaluated with identical methodology for an apples-to-apples comparison.

Setup

  • Precision: BF16 (no quantization) โ€” eliminates Q6_K confounders
  • Inference: vLLM TP=4, /v1/chat/completions endpoint, native chat template
  • Evaluation: lm-evaluation-harness with a custom humaneval_instruct_chat task that overrides the stop sequence to ["\n```"] (closing code fence) so the model's full code body is captured before \ndef, \nclass, etc. trigger early stop
  • MBPP: same vLLM endpoint, default lm-eval mbpp task
  • Decoding: greedy (temperature=0.0, top_p=1.0), max_gen_toks=2048
  • HumanEval rescoring: lm-eval's build_predictions_instruct filter strips at the opening ``` (dropping the body); a local rescore (scripts/rescore_humaneval_strip_fences.py --from-raw) reads resps[0][0], strips both opening + closing fences, and re-runs exec(prompt + completion + check) with a 10s timeout
  • MBPP rescoring: same fence-strip applied to samples_mbpp.jsonl, since chat-mode also wraps MBPP responses in ```python blocks

Results (BF16, fence-strip rescored)

Model HumanEval MBPP GPQA Diamond
gemma-4-26B-A4B-it (128e ref) 76.83% 89.60% 75.25%
gemma-4-A4B-98e-v3-it (this) 73.78% 85.60% 75.25%
ฮ” vs 128e โˆ’3.05 pp (โˆ’4.0% rel) โˆ’4.00 pp (โˆ’4.5% rel) +0.00 pp

Why these numbers replace the previous Q6_K table

An earlier version of this card reported HumanEval 10.37% / MBPP 20.60% for 98e v3 under Q6_K + llama.cpp + raw /v1/completions. Those numbers were artifacts of the eval pipeline, not the model:

  • Greedy chat-mode generation under llama.cpp's Gemma 4 path leaks reasoning tokens, hits <unused> floods, and produces markdown ```python fences that lm-eval's exec(prompt + completion + tests) scorer treats as SyntaxError.
  • The 128e reference suffered from the same pipeline (44.40% Q6_K โ†’ 89.60% BF16, +45 pp from the methodology fix alone), so the relative ฮ” between 128e and 98e was inflated by infrastructure noise on top of the actual pruning cost.
  • BF16 + vLLM + chat-completions + the humaneval_instruct_chat stop-sequence override + fence-strip rescore eliminates all of those at once.

The clean reading is what's in the table above: 23.4% expert reduction costs ~4โ€“5% relative on code generation, zero on knowledge recall.

For the original llama.cpp Gemma 4 issues that motivated the BF16 rerun:

Architecture

Unchanged from the original except num_experts: 98 (was 128):

  • Layers: 30
  • Hidden size: 2816
  • Expert intermediate size: 704 per expert
  • Dense MLP intermediate size: 2112 (always active)
  • Top-k routing: 8 (of 98 available)
  • Attention: Hybrid sliding (5) + global (1) pattern, head_dim=512
  • Vocabulary: 262,144

Files

  • config.json โ€” Model config with num_experts: 98
  • model-0000N-of-00009.safetensors โ€” Model weights (bf16)
  • expert_drop.py โ€” Deterministic expert pruning script

Supplementary llama.cpp Q6_K snapshot (2026-05-11) โ€” tainted, do not cite

These numbers are appended for archival reference only and DO NOT supersede the previously published evaluations above. They were collected with a llama.cpp build whose Gemma 4 support is incomplete; multiple known issues (reasoning-token leaks, <unused> flood, fence drift, cache-reuse gaps) are still active in this serving path. The proper way to read this section is "what numbers does the llama.cpp pipeline produce today", not "what is the true capability of the model".

  • llama.cpp tag/build: b9095-2-g0b04728 (build 590), CUDA backend on RTX 3090.
  • Serving profile: --reasoning-format deepseek --reasoning-budget 8192, chat-completions endpoint, q8_0 KV cache, --parallel 2.
  • Sampler: greedy for HE/MBPP/LCB (--temp 0 --top-p 1 --top-k 0 --seed 42), gemma4 preset for GPQA (T=1.0 / top_p=0.95 / top_k=64).
  • Scoring: lm-evaluation-harness --apply_chat_template, --use_cache, --log_samples, no fence-strip rescore applied here (chat-completions cleanly returns resps).
  • A vLLM 4-bit cross-validation of these same benchmarks is scheduled; if vLLM agrees, the deltas below are model-level; if not, the deltas are pipeline-level and the canonical (published) numbers stand.

Scores

Bench (n) 128e Q6_K 98e-v3 Q6_K 98e-v4 (cd-max) Q6_K
HumanEval-chat @3072 (164) 97.56% 73.78% 96.34%
MBPP-chat (500) 79.20% ยฑ1.82 โ€” (no clean chat run) 76.00% ยฑ1.91
LCB-medium @8k (55) 87.27% (48/55) โ€” 78.18% (43/55)
LCB-medium @16k (55) in flight โ€” 78.18% (43/55)
GPQA-Diamond (198) 65.66% ยฑ3.38 (v4-proto) 75.25% ยฑ3.07 (legacy proto) 77.27% ยฑ2.99

The 128e GPQA at 65.66% under the v4 protocol is well below the canonical 75.25% Q6_K and is the clearest evidence here that the llama.cpp serving path is the failure mode, not the model. The 75.25% legacy number remains the value to cite for 128e. The 98e-v3 row reuses the published GPQA result (run under the legacy llama.cpp protocol) for completeness; it is not re-measured here.

Output tokens per response (samples-derived)

LCB tokens are real (completion_tokens from the patched runner). HE / MBPP / GPQA are approximated as len(resps) / 4 from samples_*.jsonl; this is within ยฑ15% of true Gemma 4 tokens.

Bench Variant median mean p95 max total (n ร— mean)
HE-chat 128e 313 334 681 917 54.9k
HE-chat 98e-v3 490 512 953 1013 84.0k
HE-chat 98e-v4 303 340 755 895 55.9k
MBPP-chat 128e 194 224 453 532 112k
MBPP-chat 98e-v3 129 356 1328 1892 178k (raw-protocol outliers)
MBPP-chat 98e-v4 165 206 455 530 103k
LCB-med @8k 128e 1174 2167 7949 8192 119k
LCB-med @8k 98e-v4 2829 3667 8192 8192 202k
LCB-med @16k 98e-v4 2829 4913 15983 16064 270k
GPQA-D 128e 749 948 2098 4576 375k
GPQA-D 98e-v3 648 655 815 1076 250k
GPQA-D 98e-v4 676 783 950 5495 305k

Raising the LCB cap from 8k to 16k for 98e-v4 did not change the score (43/55 both times). The cap was binding for 22/55 problems at 8k but every truncated answer was already on a wrong trajectory; the failures are real, not truncation.

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "ManniX-ITA/gemma-4-A4B-98e-v3-it",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager",  # Gemma 4 head_dim=512 not supported by FA2
)
tok = AutoTokenizer.from_pretrained("ManniX-ITA/gemma-4-A4B-98e-v3-it")

msgs = [{"role": "user", "content": "Explain the Heisenberg uncertainty principle."}]
inputs = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=512, do_sample=True, temperature=1.0, top_p=0.95, top_k=64)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))

llama.cpp (recommended)

llama-server -m gemma-4-A4B-98e-v3-it-Q6_K.gguf \
    --port 8099 -c 32768 -ngl 99 --no-warmup \
    --reasoning-format deepseek --reasoning-budget 8192 \
    --temp 1.0 --top-p 0.95 --top-k 64 --dry-multiplier 0.5

GGUF quants: ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF

Related Models

Model Description
gemma-4-A4B-109e-v3-it 109 experts (19 dropped), clean teacher-force map
gemma-4-A4B-109e-v3-it-GGUF GGUF quants for 109e v3
gemma-4-A4B-98e-v3-it-GGUF GGUF quants for this model

License

This model inherits the Gemma license from the base model.

Acknowledgements

  • Google for the base Gemma 4 26B-A4B-it model
  • The GPQA Diamond benchmark (Rein et al., 2023)
  • bartowski for the calibration data v5 used in imatrix GGUF quantization
Downloads last month
191
Safetensors
Model size
20B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ManniX-ITA/gemma-4-A4B-98e-v3-it

Finetuned
(81)
this model
Quantizations
1 model