TheStageAI/gemma-4-E2B-it

A compressed, edge-ready variant of Google's Gemma 4 E2B (instruction-tuned), packaged for MLX on Apple Silicon Macs and iPhones. The checkpoint is roughly 7ร— smaller than the original (fits in ~1.4 GB) while preserving the capabilities that matter most for on-device assistants: general world knowledge, instruction following, and tool use.

Why this exists

Gemma 4 E2B is a "2B" model by effective parameter count, but the dense checkpoint is closer to 5.1B parameters once Per-Layer Embeddings (PLE) are counted โ€” and in BF16 the PLE table alone is ~4.7 GB. On mobile hardware, three things block deployment: download size, runtime memory footprint (iOS enforces a ~3 GB per-app budget), and generation speed. We compress the model along its natural structure to address all three at once.

How it was compressed

  • Transformer blocks โ€” GPTQ with Quantization Error Propagation (QEP) and range clipping, emitted as flat, MLX-compatible per-group weight-only tensors.
  • PLE tables โ€” an AQLM-style vector-quantization codec with sensitivity-weighted (Fisher-style) assignments. The 4.7 GB BF16 PLE compresses to **0.26 GB** (indices + codebooks), decompressed on the fly with a single batched gather across all layers.
  • Token embeddings / LM head โ€” flat per-group scalar quantization matched to the same runtime contract.
  • Bit-width schedule โ€” chosen per module by Riemannian Constrained Optimization (RCO) under an exact byte budget; the release checkpoint is re-quantized from the dense model in one consistent GPTQ/QEP pass.

Operating points

This repo ships two release operating points, selected via the size argument:

size Trade-off Compression
l More quality, larger artifact 5.62ร—
m Smaller headline target (default) 6.40ร—

It also includes optional 4-bit vision and audio towers for image understanding and audio transcription.

Usage

git clone https://github.com/TheStageAI/edge-lm.git
pip install -e edge-lm
from edge_lm import load
from mlx_vlm import stream_generate

model, tokenizer = load("TheStageAI/gemma-4-E2B-it", size="m")

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Explain gravity in one sentence."}],
    tokenize=False, add_generation_prompt=True,
)
for chunk in stream_generate(model, tokenizer, prompt, max_tokens=128):
    print(chunk.text, end="", flush=True)

Vision and audio (loads the optional towers):

model, tokenizer = load("TheStageAI/gemma-4-E2B-it", include_vision=True)   # image understanding
model, tokenizer = load("TheStageAI/gemma-4-E2B-it", include_audio=True)    # audio transcription

Only the files needed for the requested size are downloaded.

Benchmarks

Every model โ€” ours and the GGUF baselines โ€” is dequantized to a standard BF16 checkpoint and served through vLLM, so the backend is equalized. We report MMLU-Pro (general knowledge), IFEval (instruction following), and ฯ„ยฒ-Bench / Tau2 (multi-step tool use). For Tau2 the Gemma checkpoint acts as the agent while a fixed Qwen3-235B-A22B-2507 simulates the user.

Model Compression MMLU-Pro IFEval Tau2 (avg of 3)
BF16 (reference) 1.00ร— 61.85 74.68 30.67
Ours L 5.62ร— 54.48 74.86 22.20
Ours M 6.40ร— 49.85 71.53 23.45
Unsloth Q3-K-S 3.81ร— 48.20 64.51 18.69
Unsloth UD-Q2-K-XL 3.87ร— 43.17 66.54 20.23

Bold marks the best result among the compressed checkpoints in each column.

Files

File Contents
config.json Shared model config (architecture)
model_{s,m,l}.safetensors Quantized decoder weights per operating point (quantization map in metadata)
ple_{s,m,l}.safetensors Compact AQLM PLE codes + codebooks
vision_tower.safetensors Optional 4-bit vision tower
audio_tower.safetensors Optional 4-bit audio tower
tokenizer.json, tokenizer_config.json Tokenizer

License

Released under the MIT License, ยฉ 2025 thestage.ai labs. As a derivative of Google's Gemma 4, the weights are additionally subject to the Gemma Terms of Use.

Citation

If you use these checkpoints, please cite the Gemma 4 release and the methods we build on (GPTQ, QEP, AQLM, RCO) โ€” see the references in the edge-lm write-up.

Downloads last month
18
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for TheStageAI/gemma-4-E2B-it

Quantized
(209)
this model

Collection including TheStageAI/gemma-4-E2B-it