Instructions to use amd/Kimi-K2.5-Eagle3-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use amd/Kimi-K2.5-Eagle3-FP8 with Transformers:
# Load model directly from transformers import AutoTokenizer, LlamaForCausalLMEagle3 tokenizer = AutoTokenizer.from_pretrained("amd/Kimi-K2.5-Eagle3-FP8") model = LlamaForCausalLMEagle3.from_pretrained("amd/Kimi-K2.5-Eagle3-FP8") - Notebooks
- Google Colab
- Kaggle
Model Overview
kimi-k2.5-eagle3-fp8 is an FP8-quantized version of lightseekorg/kimi-k2.5-eagle3, an Eagle3 MTP draft model for accelerating inference of Kimi-K2.5 with speculative decoding.
This checkpoint was quantized with AMD Quark. The quantized tensors use FP8 quantization metadata in the model config. The LM head is not quantized and was intentionally excluded from quantization.
Model Quantization
The checkpoint keeps the original Eagle3 architecture and exports Quark quantization metadata in config.json. The fc projection and lm_head are intentionally not quantized.
Quantization details:
- Quantization tool: AMD Quark
- Quantization method:
quark - Quantization scheme:
ptpc_fp8 - FP8 format:
fp8_e4m3 - Weight quantization: FP8 E4M3, static, per-channel, symmetric, channel axis
0 - Input/activation quantization config: FP8 E4M3, dynamic, per-channel, symmetric, channel axis
1 - Export weight format:
real_quantized - Output tensor quantization: not enabled
- KV-cache quantization: not enabled
- Excluded from quantization:
fc,lm_head
Quantization Command
cd Quark/examples/torch/language_modeling/llm_ptq/
python3 quantize_quark.py \
--model_dir lightseekorg/kimi-k2.5-eagle3 \
--quant_scheme ptpc_fp8 \
--exclude_layers fc lm_head \
--output_dir amd/kimi-k2.5-eagle3-fp8 \
--file2file_quantization
No calibration dataset is required for this file-to-file quantization path.
Quantization Environment
Quantization was run on a single AMD GPU (ROCm) using the following software stack:
| Component | Version |
|---|---|
| OS | Linux (x86_64) |
| Python | 3.11 |
| PyTorch | 2.9.0 (ROCm 6.4 build) |
pytorch-triton-rocm |
3.5.0 |
| Transformers | 4.57.1 |
huggingface_hub |
0.36.0 |
accelerate |
1.11.0 |
safetensors |
0.6.2 |
datasets |
3.6.0 |
vLLM Loading Note
When using this FP8 Eagle3 checkpoint as a vLLM draft model, make sure the exported config.json records the excluded layers as regex patterns. If Quark exports:
"exclude": [
"fc",
"lm_head"
]
change it to:
"exclude": [
"re:.*fc.*",
"re:.*lm_head.*"
]
This keeps fc and lm_head unquantized while allowing vLLM to correctly load the Quark FP8 Eagle3 draft model.
Quantized Layers
The following Eagle3 projection weights are stored as F8_E4M3 with associated F32 per-channel scale tensors:
midlayer.self_attn.q_proj.weightmidlayer.self_attn.k_proj.weightmidlayer.self_attn.v_proj.weightmidlayer.self_attn.o_proj.weightmidlayer.mlp.gate_proj.weightmidlayer.mlp.up_proj.weightmidlayer.mlp.down_proj.weight
Each quantized weight tensor has a matching *_weight_scale tensor stored in FP32.
Layers Not Quantized
The following tensors are intentionally not stored as FP8:
fc.weight: kept inF16lm_head.weight: kept inF16embed_tokens.weight: kept inBF16- normalization weights: kept in
F16
Tensor Dtype Overview
| Tensor dtype | Count | Notes |
|---|---|---|
F8_E4M3 |
7 | Quantized attention and MLP projection weights |
F32 |
7 | Per-channel scale tensors for FP8 weights |
F16 |
6 | Excluded fc, lm_head, and normalization weights |
BF16 |
1 | Token embedding weight |
Intended Use
This model is intended to be used as an Eagle3 draft model for speculative decoding with moonshotai/Kimi-K2.5 as the target model.
Because this is an AMD Quark FP8 checkpoint, make sure your inference runtime supports the quantization format and Eagle3 speculative decoding before deployment. Please validate quality and acceptance length in your own serving stack.
Reproduction
The throughput numbers in Results were produced with
vLLM on a single AMD Instinct MI355X node, using
amd/kimi-k2.5-eagle3-fp8 as the EAGLE3 draft model for an MXFP4 target. Three models are
involved:
- Target model:
amd/Kimi-K2.5-MXFP4 - BF16 draft model:
lightseekorg/kimi-k2.5-eagle3 - FP8 draft model:
amd/kimi-k2.5-eagle3-fp8(this model), quantized with AMD Quark FP8 metadata and sharing the BF16 target LM head.
In this setup, the FP8 draft path dispatches through vLLM RowWiseTorchFP8ScaledMMLinearKernel
— i.e. torch._scaled_mm over hipBLASLt row-wise scaled FP8 GEMM — rather than the AITER
preshuffled FP8 path. The target MXFP4 model uses the ROCm FP4 ASM path via
VLLM_ROCM_USE_AITER_FP4_ASM_GEMM=1.
Docker Images
| Draft | Docker image | --max-model-len |
|---|---|---|
BF16 (lightseekorg/kimi-k2.5-eagle3) |
vllm/vllm-openai-rocm:v0.19.0 |
2248 |
FP8 (amd/kimi-k2.5-eagle3-fp8) |
vllm/vllm-openai-rocm:nightly-fb1ac806c55a6dc96fe92261b80c8550e9c39d2f |
2304 |
Serving (FP8 Eagle3 draft)
Launch the FP8 container (standard ROCm device mounts), then start the vLLM server. This example matches the TP=4, ISL/OSL = 1K/1K sweep:
docker run -it --rm \
--device /dev/kfd --device /dev/dri --group-add video \
--ipc host --shm-size 16g --network host \
vllm/vllm-openai-rocm:nightly-fb1ac806c55a6dc96fe92261b80c8550e9c39d2f \
bash
# Inside the container:
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_FP4_ASM_GEMM=1 # target MXFP4 FP4 ASM GEMM
export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4
export VLLM_ROCM_USE_AITER_RMSNORM=0 # required for TP < 8
vllm serve amd/Kimi-K2.5-MXFP4 \
--port 8888 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.90 \
--max-model-len 2304 \
--no-enable-prefix-caching \
--trust-remote-code \
--mm-encoder-tp-mode data \
--speculative-config '{"model":"amd/kimi-k2.5-eagle3-fp8","method":"eagle3","num_speculative_tokens":6,"draft_tensor_parallel_size":1}'
- BF16 draft baseline: use image
vllm/vllm-openai-rocm:v0.19.0,--max-model-len 2248, and"model":"lightseekorg/kimi-k2.5-eagle3"in the speculative config. - No-spec baseline: drop
--speculative-configentirely.
Benchmarking
With the server up, run a vllm bench serve throughput sweep over concurrency
C ∈ {4, 8, 16, 32, 64} (10 prompts per concurrency; input/output lengths are sampled around
the 1K target with --random-range-ratio 0.8, and --ignore-eos forces each request to emit
its full sampled output length):
for C in 4 8 16 32 64; do
vllm bench serve \
--model amd/Kimi-K2.5-MXFP4 \
--backend vllm \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 1024 \
--random-range-ratio 0.8 \
--num-prompts $((C * 10)) \
--max-concurrency "$C" \
--request-rate inf \
--ignore-eos \
--use-chat-template \
--trust-remote-code
done
Throughput is reported as decode tokens/s per GPU (total throughput divided by the 4 GPUs); speedups in parentheses are relative to the no-spec baseline at the same concurrency.
Results
Kimi K2.5 Eagle3: BF16 and AMD Quark FP8 Drafts — amd/Kimi-K2.5-MXFP4 target,
ISL/OSL = 1K/1K, TP=4 on a single AMD Instinct MI355X node.
| Concurrency | No-spec (tok/s/GPU) | BF16 Eagle3 (tok/s/GPU) | FP8 Eagle3 (tok/s/GPU) |
|---|---|---|---|
| 4 | 82.7 | 157.0 (1.90x) | 165.2 (2.00x) |
| 8 | 142.2 | 269.1 (1.89x) | 270.1 (1.90x) |
| 16 | 220.5 | 399.6 (1.81x) | 412.7 (1.87x) |
| 32 | 342.2 | 627.6 (1.83x) | 633.8 (1.85x) |
| 64 | 533.3 | 901.6 (1.69x) | 936.6 (1.76x) |
Across all tested concurrencies, the AMD Quark FP8 Eagle3 draft matches or exceeds the BF16 draft throughput, reaching up to 2.00x over the no-spec baseline.
Citation and Acknowledgements
This model is derived from lightseekorg/kimi-k2.5-eagle3. Please refer to the source model card for the original training details, benchmarks, and acknowledgements.
License
Modifications Copyright(c) 2026 Advanced Micro Devices, Inc. All rights reserved.
- Downloads last month
- 332