Gemma 4 A4B 98-Expert v3 (20.8B)
Expert-pruned version of google/gemma-4-26B-A4B-it, reduced from 128 to 98 experts per MoE layer using a contribution-weighted importance map aggregated across all task categories (math, logic, code, science, creative).
| Original (128e) | 109e v3 | This model (98e v3) | |
|---|---|---|---|
| Total params | 26B | 22.4B | ~20.8B |
| Experts per layer | 128 | 109 | 98 |
| Experts dropped | โ | 19/layer | 30/layer |
| MoE capacity removed | โ | 14.8% | 23.4% |
| Top-k routing | 8 | 8 | 8 |
| GPQA Diamond (Q6_K) | 75.25% | 71.72% | 75.25% |
Zero GPQA degradation despite dropping 30 experts per layer (23.4% of MoE capacity). This model matches the full 128-expert reference exactly on GPQA Diamond.
GGUF quantizations available at ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF โ includes standard Bartowski quants + ContribDynamic (CD) per-layer quants.
Pruning Method
Contribution-Weighted Expert Analysis
The drop map is derived from expert_neuron_v4.json โ a comprehensive per-expert contribution analysis across all task categories (math, logic, code, science, creative) using 128-token teacher-force analysis on the full 128-expert reference model.
Process (scripts/expert_drop.py):
- Contribution scoring: For each expert in each layer, the total contribution (
tcfield) is computed as the sum of weighted output norms across all task categories. - Per-layer ranking: Experts are ranked by total contribution within each layer.
- Drop decision: The 30 lowest-contributing experts per layer are dropped (128 โ 98).
- Router resize: The MoE router
proj.weightis resized from[128, hidden]to[98, hidden], keeping only rows for retained experts. The top-8 routing naturally adapts.
Why 98e Works Better Than 109e
The 98e v3 model uses a different importance map than 109e v3:
- 109e v3: Uses a per-question top-16 protection scheme on GPQA Diamond specifically (teacher-force analysis, 196 questions). This over-fits the drop map to GPQA.
- 98e v3: Uses
expert_neuron_v4.jsonwhich aggregates contribution across all task categories with 128-token analysis windows. This produces a more generalizable importance ranking.
The result: 98e v3 matches the 128e reference (75.25%) while dropping 30 experts, whereas 109e v3 drops only 19 experts but loses 3.53 pp. The broader importance map makes better pruning decisions.
Key Findings
- Experts are NOT topic-specialized: Top-32 overlap is 28/32 between math and creative domains. The same experts are important across tasks.
- Contribution is moderately concentrated: Gini coefficient ~0.38. You need ~75 experts per layer for 80% of the contribution โ the bottom 30 carry very little signal.
- Expert weight similarity is near zero: Max cosine similarity ~0.05 between expert weight matrices. Merging experts by averaging destroys the model. Expert dropping is the only viable structural compression.
- Early layers matter most: Layer 0 has highest importance (1.0), layers 28-29 are lowest (~0.04-0.05). But the drop is applied uniformly across layers.
GPQA Diamond Evaluation
Setup (identical methodology to 109e v3)
- Quantization: GGUF Q6_K via llama.cpp
llama-quantize(imatrix calibration) - Inference: llama.cpp
llama-server(OpenAI-compatible API) - Evaluation: lm-evaluation-harness, task
gpqa_diamond_cot_zeroshot - GPU: NVIDIA RTX 3090 (24 GB)
Configuration
| Parameter | Value |
|---|---|
| Context size | 32768 tokens |
| Reasoning format | deepseek |
| Reasoning budget | 8192 tokens |
| Temperature | 1.0 (Gemma 4 official) |
| top_p | 0.95 |
| top_k | 64 |
| DRY multiplier | 0.5 |
| Tokenizer | google/gemma-4-26B-A4B-it (original) |
Results
| Model | Experts/Layer | GPQA Diamond (flex) | Delta vs 128e |
|---|---|---|---|
| gemma-4-26B-A4B-it (128e ref) | 128 | 75.25% | โ |
| gemma-4-A4B-98e-v3-it (this) | 98 | 75.25% | +0.00 pp |
| gemma-4-A4B-109e-v3-it | 109 | 71.72% | -3.53 pp |
| gemma-4-E4B (small Gemma 4) | โ | 57.07% | -18.18 pp |
Code Benchmarks (HumanEval + MBPP)
Code-generation benchmarks were run at BF16 via vLLM (TP=4 on 4ร RTX 3090) using a chat-completions pipeline that bypasses the llama.cpp Gemma 4 reasoning-token / shared-KV bugs. Both 128e (reference) and 98e v3 (this model) were evaluated with identical methodology for an apples-to-apples comparison.
Setup
- Precision: BF16 (no quantization) โ eliminates Q6_K confounders
- Inference: vLLM TP=4,
/v1/chat/completionsendpoint, native chat template - Evaluation: lm-evaluation-harness with a custom
humaneval_instruct_chattask that overrides the stop sequence to["\n```"](closing code fence) so the model's full code body is captured before\ndef,\nclass, etc. trigger early stop - MBPP: same vLLM endpoint, default lm-eval
mbpptask - Decoding: greedy (
temperature=0.0, top_p=1.0),max_gen_toks=2048 - HumanEval rescoring: lm-eval's
build_predictions_instructfilter strips at the opening```(dropping the body); a local rescore (scripts/rescore_humaneval_strip_fences.py --from-raw) readsresps[0][0], strips both opening + closing fences, and re-runsexec(prompt + completion + check)with a 10s timeout - MBPP rescoring: same fence-strip applied to
samples_mbpp.jsonl, since chat-mode also wraps MBPP responses in```pythonblocks
Results (BF16, fence-strip rescored)
| Model | HumanEval | MBPP | GPQA Diamond |
|---|---|---|---|
| gemma-4-26B-A4B-it (128e ref) | 76.83% | 89.60% | 75.25% |
| gemma-4-A4B-98e-v3-it (this) | 73.78% | 85.60% | 75.25% |
| ฮ vs 128e | โ3.05 pp (โ4.0% rel) | โ4.00 pp (โ4.5% rel) | +0.00 pp |
Why these numbers replace the previous Q6_K table
An earlier version of this card reported HumanEval 10.37% / MBPP 20.60% for 98e v3 under Q6_K + llama.cpp + raw /v1/completions. Those numbers were artifacts of the eval pipeline, not the model:
- Greedy chat-mode generation under llama.cpp's Gemma 4 path leaks reasoning tokens, hits
<unused>floods, and produces markdown```pythonfences that lm-eval'sexec(prompt + completion + tests)scorer treats asSyntaxError. - The 128e reference suffered from the same pipeline (44.40% Q6_K โ 89.60% BF16, +45 pp from the methodology fix alone), so the relative ฮ between 128e and 98e was inflated by infrastructure noise on top of the actual pruning cost.
- BF16 + vLLM + chat-completions + the
humaneval_instruct_chatstop-sequence override + fence-strip rescore eliminates all of those at once.
The clean reading is what's in the table above: 23.4% expert reduction costs ~4โ5% relative on code generation, zero on knowledge recall.
For the original llama.cpp Gemma 4 issues that motivated the BF16 rerun:
- llama.cpp #21321 โ Gemma 4 generates
<unused24>tokens - llama.cpp #21338 โ Can't disable thinking in gemma4-26b-a4b
- llama.cpp #21468 โ Cache reuse not supported for Gemma 4
- llama.cpp #21516 โ Gemma 4 infinite
<unused>loop
Architecture
Unchanged from the original except num_experts: 98 (was 128):
- Layers: 30
- Hidden size: 2816
- Expert intermediate size: 704 per expert
- Dense MLP intermediate size: 2112 (always active)
- Top-k routing: 8 (of 98 available)
- Attention: Hybrid sliding (5) + global (1) pattern,
head_dim=512 - Vocabulary: 262,144
Files
config.jsonโ Model config withnum_experts: 98model-0000N-of-00009.safetensorsโ Model weights (bf16)expert_drop.pyโ Deterministic expert pruning script
Supplementary llama.cpp Q6_K snapshot (2026-05-11) โ tainted, do not cite
These numbers are appended for archival reference only and DO NOT supersede the previously published evaluations above. They were collected with a llama.cpp build whose Gemma 4 support is incomplete; multiple known issues (reasoning-token leaks,
<unused>flood, fence drift, cache-reuse gaps) are still active in this serving path. The proper way to read this section is "what numbers does the llama.cpp pipeline produce today", not "what is the true capability of the model".
- llama.cpp tag/build:
b9095-2-g0b04728(build 590), CUDA backend on RTX 3090. - Serving profile:
--reasoning-format deepseek --reasoning-budget 8192, chat-completions endpoint, q8_0 KV cache,--parallel 2. - Sampler: greedy for HE/MBPP/LCB (
--temp 0 --top-p 1 --top-k 0 --seed 42), gemma4 preset for GPQA (T=1.0 / top_p=0.95 / top_k=64). - Scoring: lm-evaluation-harness
--apply_chat_template,--use_cache,--log_samples, no fence-strip rescore applied here (chat-completions cleanly returnsresps). - A vLLM 4-bit cross-validation of these same benchmarks is scheduled; if vLLM agrees, the deltas below are model-level; if not, the deltas are pipeline-level and the canonical (published) numbers stand.
Scores
| Bench (n) | 128e Q6_K | 98e-v3 Q6_K | 98e-v4 (cd-max) Q6_K |
|---|---|---|---|
| HumanEval-chat @3072 (164) | 97.56% | 73.78% | 96.34% |
| MBPP-chat (500) | 79.20% ยฑ1.82 | โ (no clean chat run) | 76.00% ยฑ1.91 |
| LCB-medium @8k (55) | 87.27% (48/55) | โ | 78.18% (43/55) |
| LCB-medium @16k (55) | in flight | โ | 78.18% (43/55) |
| GPQA-Diamond (198) | 65.66% ยฑ3.38 (v4-proto) | 75.25% ยฑ3.07 (legacy proto) | 77.27% ยฑ2.99 |
The 128e GPQA at 65.66% under the v4 protocol is well below the canonical 75.25% Q6_K and is the clearest evidence here that the llama.cpp serving path is the failure mode, not the model. The 75.25% legacy number remains the value to cite for 128e. The 98e-v3 row reuses the published GPQA result (run under the legacy llama.cpp protocol) for completeness; it is not re-measured here.
Output tokens per response (samples-derived)
LCB tokens are real (completion_tokens from the patched runner). HE / MBPP / GPQA are approximated as len(resps) / 4 from samples_*.jsonl; this is within ยฑ15% of true Gemma 4 tokens.
| Bench | Variant | median | mean | p95 | max | total (n ร mean) |
|---|---|---|---|---|---|---|
| HE-chat | 128e | 313 | 334 | 681 | 917 | 54.9k |
| HE-chat | 98e-v3 | 490 | 512 | 953 | 1013 | 84.0k |
| HE-chat | 98e-v4 | 303 | 340 | 755 | 895 | 55.9k |
| MBPP-chat | 128e | 194 | 224 | 453 | 532 | 112k |
| MBPP-chat | 98e-v3 | 129 | 356 | 1328 | 1892 | 178k (raw-protocol outliers) |
| MBPP-chat | 98e-v4 | 165 | 206 | 455 | 530 | 103k |
| LCB-med @8k | 128e | 1174 | 2167 | 7949 | 8192 | 119k |
| LCB-med @8k | 98e-v4 | 2829 | 3667 | 8192 | 8192 | 202k |
| LCB-med @16k | 98e-v4 | 2829 | 4913 | 15983 | 16064 | 270k |
| GPQA-D | 128e | 749 | 948 | 2098 | 4576 | 375k |
| GPQA-D | 98e-v3 | 648 | 655 | 815 | 1076 | 250k |
| GPQA-D | 98e-v4 | 676 | 783 | 950 | 5495 | 305k |
Raising the LCB cap from 8k to 16k for 98e-v4 did not change the score (43/55 both times). The cap was binding for 22/55 problems at 8k but every truncated answer was already on a wrong trajectory; the failures are real, not truncation.
Usage
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"ManniX-ITA/gemma-4-A4B-98e-v3-it",
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="eager", # Gemma 4 head_dim=512 not supported by FA2
)
tok = AutoTokenizer.from_pretrained("ManniX-ITA/gemma-4-A4B-98e-v3-it")
msgs = [{"role": "user", "content": "Explain the Heisenberg uncertainty principle."}]
inputs = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=512, do_sample=True, temperature=1.0, top_p=0.95, top_k=64)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
llama.cpp (recommended)
llama-server -m gemma-4-A4B-98e-v3-it-Q6_K.gguf \
--port 8099 -c 32768 -ngl 99 --no-warmup \
--reasoning-format deepseek --reasoning-budget 8192 \
--temp 1.0 --top-p 0.95 --top-k 64 --dry-multiplier 0.5
GGUF quants: ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF
Related Models
| Model | Description |
|---|---|
| gemma-4-A4B-109e-v3-it | 109 experts (19 dropped), clean teacher-force map |
| gemma-4-A4B-109e-v3-it-GGUF | GGUF quants for 109e v3 |
| gemma-4-A4B-98e-v3-it-GGUF | GGUF quants for this model |
License
This model inherits the Gemma license from the base model.
Acknowledgements
- Google for the base Gemma 4 26B-A4B-it model
- The GPQA Diamond benchmark (Rein et al., 2023)
- bartowski for the calibration data v5 used in imatrix GGUF quantization
- Downloads last month
- 191