gemma-4-31B-it-fp8

FP8 quantized version of google/gemma-4-31B-it (31B params, server model). Produced and maintained by vrfai.

Quantization Details

This model was quantized using NVIDIA ModelOpt with the following configurations:

Property Value
Base model google/gemma-4-31B-it
Quant method NVIDIA ModelOpt (FP8 E4M3 - num_bits: (4, 3))
Weight scheme Per-channel (axis: 0)
Input activation Dynamic Per-token (type: dynamic)
Calibration algorithm max

Usage

You can deploy this model using vLLM with the modelopt quantization backend. Please ensure you refer to the vLLM documentation for Gemma 4 for advanced serving options.

vllm serve vrfai/gemma-4-31B-it-fp8
  --quantization modelopt \
  --max-model-len 32768 \
  --max-num-seqs 128 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --async-scheduling \
  --trust-remote-code
Downloads last month
355
Safetensors
Model size
31B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including vrfai/gemma-4-31B-it-fp8