gemma-4-31B-it-speculator.eagle3-NVFP4 (NVFP4)
NVFP4 quantization of RedHatAI/gemma-4-31B-it-speculator.eagle3 β RedHat / Neural Magic's official EAGLE-3 speculator drafter for Gemma 4 31B (dense).
What this is
Drop-in replacement for the BF16 drafter, 3Γ smaller (4.5 GB β ~1.5 GB) and ~1.5Γ faster per draft step on Blackwell with native FP4 tensor cores. Targets the same verifier model as the BF16 source.
Use it with vLLM
vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \
--tensor-parallel-size 1 \
--speculative-config '{
"model": "AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4",
"num_speculative_tokens": 3,
"method": "eagle3"
}' \
--max-num-seqs 8 \
--kv-cache-dtype fp8 \
--enable-chunked-prefill \
--enable-prefix-caching
Verifier can be any Gemma 4 31B (dense) variant β base, instruct, NVFP4, or fine-tunes (abliterated, domain-tuned, etc.). EAGLE drafters are architecture-bound, not weights-bound; output distribution is provably the verifier's.
Quantization recipe
| Field | Value |
|---|---|
| Algorithm | NVIDIA ModelOpt NVFP4_DEFAULT_CFG (max calibration, no AWQ) |
| Block size | 16 (NVFP4 standard) |
| Excluded from quantization | lm_head, embed_tokens, d2t (vocab map) |
| Calibration data | 256 conversations from HuggingFaceH4/ultrachat_200k (train_sft) |
| Calibration mode | Realistic β ran target NVFP4 verifier first, captured aux hidden states at layers eagle_aux_hidden_state_layer_ids, fed to drafter alongside input_ids |
| Hardware | 1Γ NVIDIA RTX PRO 6000 Blackwell (96 GB) |
| Output dtype | NVFP4 (FP4 E2M1 + per-block FP8 scales + per-tensor FP32 scales) |
| Modelopt version | 0.43.0rc2.dev (main, with merged PRs #1264 + #1265) |
Performance expectations
Acceptance lengths (relative to BF16 source β RedHat's published numbers):
| Dataset | BF16 (k=5) | NVFP4 (estimate, k=5) |
|---|---|---|
| HumanEval | 3.80 | ~3.40 |
| math_reasoning | 3.93 | ~3.50 |
| qa | 2.38 | ~2.20 |
| MT-bench | 2.83 | ~2.60 |
| RAG | 2.80 | ~2.60 |
| summarization | 2.20 | ~2.05 |
| translation | 2.68 | ~2.45 |
Roughly 8-12% acceptance loss vs BF16, more than offset by per-step speedup on Blackwell native FP4 hardware.
Spark / DGX Spark deployment
Tested on NVIDIA DGX Spark (GB10, sm 12.1, 128 GB unified memory) using the
ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4:latest image (eugr nightly with
sm_120-compiled FlashInfer CUTLASS + VLLM_CUTLASS NVFP4 kernels).
Single-stream wall-clock: 2.0-2.5Γ speedup over no spec decode on chat workloads.
Files
model.safetensorsβ NVFP4 quantized drafter weights (~1.5 GB)config.jsonβ Eagle3 speculator config (carries verifier reference)config.pyβ CustomEagle3SpeculatorConfigclass (custom_code, required fortrust_remote_code=True)tokenizer.json,tokenizer_config.jsonβ Verifier tokenizer (Gemma 4)hf_quant_config.jsonβ ModelOpt NVFP4 quantization metadatamodelopt_state.ptβ Full modelopt state for re-export
License
Apache 2.0 (matches base model). NVFP4 quantization is a derivative work contributed under the same terms.
Provenance
Created by quantizing RedHatAI/gemma-4-31B-it-speculator.eagle3 with NVFP4_DEFAULT_CFG calibrated against
RedHatAI/gemma-4-31B-it-NVFP4 outputs. Methodology adapted from RedHat AI's published
Gemma 4 NVFP4 target recipe + standard EAGLE-3 calibration practice.
See also
- BF16 source: RedHatAI/gemma-4-31B-it-speculator.eagle3
- Verifier (NVFP4): RedHatAI/gemma-4-31B-it-NVFP4
- EAGLE-3 paper: arXiv:2503.01840
- Speculators library: vllm-project/speculators
β Support the work
If this release has been useful, tips are deeply appreciated β they go directly toward more compute, more models, and more open releases.
βΏ Bitcoin (BTC)![]() bc1q09xmzn00q4z3c5raene0f3pzn9d9pvawfm0py4
|
Ξ Ethereum (ETH)![]() 0x1512667F6D61454ad531d2E45C0a5d1fd82D0500
|
β Solana (SOL)![]() DgQsjHdAnT5PNLQTNpJdpLS3tYGpVcsHQCkpoiAKsw8t
|
β Monero (XMR)![]() 836XrSKw4R76vNi3QPJ5Fa9ugcyvE2cWmKSPv3AhpTNNKvqP8v5ba9JRL4Vh7UnFNjDz3E2GXZDVVenu3rkZaNdUFhjAvgd
|
Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.
- Downloads last month
- 4,144
Model tree for AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4
Base model
RedHatAI/gemma-4-31B-it-speculator.eagle3


