gemma-4-31B-it-speculator.eagle3-NVFP4 (NVFP4)

NVFP4 quantization of RedHatAI/gemma-4-31B-it-speculator.eagle3 β€” RedHat / Neural Magic's official EAGLE-3 speculator drafter for Gemma 4 31B (dense).

What this is

Drop-in replacement for the BF16 drafter, 3Γ— smaller (4.5 GB β†’ ~1.5 GB) and ~1.5Γ— faster per draft step on Blackwell with native FP4 tensor cores. Targets the same verifier model as the BF16 source.

Use it with vLLM

vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \
  --tensor-parallel-size 1 \
  --speculative-config '{
    "model": "AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4",
    "num_speculative_tokens": 3,
    "method": "eagle3"
  }' \
  --max-num-seqs 8 \
  --kv-cache-dtype fp8 \
  --enable-chunked-prefill \
  --enable-prefix-caching

Verifier can be any Gemma 4 31B (dense) variant β€” base, instruct, NVFP4, or fine-tunes (abliterated, domain-tuned, etc.). EAGLE drafters are architecture-bound, not weights-bound; output distribution is provably the verifier's.

Quantization recipe

Field Value
Algorithm NVIDIA ModelOpt NVFP4_DEFAULT_CFG (max calibration, no AWQ)
Block size 16 (NVFP4 standard)
Excluded from quantization lm_head, embed_tokens, d2t (vocab map)
Calibration data 256 conversations from HuggingFaceH4/ultrachat_200k (train_sft)
Calibration mode Realistic β€” ran target NVFP4 verifier first, captured aux hidden states at layers eagle_aux_hidden_state_layer_ids, fed to drafter alongside input_ids
Hardware 1Γ— NVIDIA RTX PRO 6000 Blackwell (96 GB)
Output dtype NVFP4 (FP4 E2M1 + per-block FP8 scales + per-tensor FP32 scales)
Modelopt version 0.43.0rc2.dev (main, with merged PRs #1264 + #1265)

Performance expectations

Acceptance lengths (relative to BF16 source β€” RedHat's published numbers):

Dataset BF16 (k=5) NVFP4 (estimate, k=5)
HumanEval 3.80 ~3.40
math_reasoning 3.93 ~3.50
qa 2.38 ~2.20
MT-bench 2.83 ~2.60
RAG 2.80 ~2.60
summarization 2.20 ~2.05
translation 2.68 ~2.45

Roughly 8-12% acceptance loss vs BF16, more than offset by per-step speedup on Blackwell native FP4 hardware.

Spark / DGX Spark deployment

Tested on NVIDIA DGX Spark (GB10, sm 12.1, 128 GB unified memory) using the ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4:latest image (eugr nightly with sm_120-compiled FlashInfer CUTLASS + VLLM_CUTLASS NVFP4 kernels).

Single-stream wall-clock: 2.0-2.5Γ— speedup over no spec decode on chat workloads.

Files

  • model.safetensors β€” NVFP4 quantized drafter weights (~1.5 GB)
  • config.json β€” Eagle3 speculator config (carries verifier reference)
  • config.py β€” Custom Eagle3SpeculatorConfig class (custom_code, required for trust_remote_code=True)
  • tokenizer.json, tokenizer_config.json β€” Verifier tokenizer (Gemma 4)
  • hf_quant_config.json β€” ModelOpt NVFP4 quantization metadata
  • modelopt_state.pt β€” Full modelopt state for re-export

License

Apache 2.0 (matches base model). NVFP4 quantization is a derivative work contributed under the same terms.

Provenance

Created by quantizing RedHatAI/gemma-4-31B-it-speculator.eagle3 with NVFP4_DEFAULT_CFG calibrated against RedHatAI/gemma-4-31B-it-NVFP4 outputs. Methodology adapted from RedHat AI's published Gemma 4 NVFP4 target recipe + standard EAGLE-3 calibration practice.

See also


β˜• Support the work

If this release has been useful, tips are deeply appreciated β€” they go directly toward more compute, more models, and more open releases.

β‚Ώ Bitcoin (BTC)
BTC QR
bc1q09xmzn00q4z3c5raene0f3pzn9d9pvawfm0py4
Ξ Ethereum (ETH)
ETH QR
0x1512667F6D61454ad531d2E45C0a5d1fd82D0500
β—Ž Solana (SOL)
SOL QR
DgQsjHdAnT5PNLQTNpJdpLS3tYGpVcsHQCkpoiAKsw8t
β“œ Monero (XMR)
XMR QR
836XrSKw4R76vNi3QPJ5Fa9ugcyvE2cWmKSPv3AhpTNNKvqP8v5ba9JRL4Vh7UnFNjDz3E2GXZDVVenu3rkZaNdUFhjAvgd

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.

Downloads last month
4,144
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4

Quantized
(1)
this model

Collection including AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4

Paper for AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4