You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Kimi-K2.6-NVFP4

NVFP4 post-training quantization of moonshotai/Kimi-K2.6 for serving on NVIDIA Blackwell hardware with vLLM.

Third-party model. Weights are derived from moonshotai/Kimi-K2.6 (Moonshot AI). Wafer AI is not affiliated with Moonshot AI; this checkpoint is a community quantization. Use of this model is governed by the upstream Modified MIT License.

At a glance

Architecture MoE (DeepSeek-V3 style) with Kimi vision tower; ~1T total parameters
Weight format NVFP4 (FP4 E2M1, 16-element block scale in FP8 E4M3, FP32 per-tensor scale — all calibrated)
Activation format NVFP4 (calibrated FP32 per-tensor scale; FP8 E4M3 per-16-element block scale computed at inference)
KV cache FP8 E4M3
Excluded from quant (kept BF16) lm_head, all *self_attn* projections (MLA), *vision_tower* (covers MoonViT), *mm_projector* — set in quantization_config.ignore. MoE routers (mlp.gate.weight) are also BF16: modelopt leaves them as scoring layers despite not being in the explicit ignore list.
Storage ~590 GB (550 GiB) across 119 safetensors shards
License Modified MIT (inherits from upstream — see LICENSE)

Evaluation

Evaluated against the upstream moonshotai/Kimi-K2.6 (W4A16 compressed-tensors pack-quantized INT4, group=32) source. Both endpoints served via the NGC container nvcr.io/nvidia/vllm:26.03.post1-py3 (vLLM 0.17.1+bd67d66a.nvinternal), --enforce-eager, --moe-backend cutlass, FP8 KV cache, on 8× NVIDIA B300 SXM6.

Eval harness: lm-evaluation-harness 0.4.11 via the local-completions adapter, base-completion mode, num_concurrent=4, batch_size=1. Default lm-eval seed (0) and gen_kwargs.

Benchmark This (NVFP4) INT4 source Δ
GSM8K-CoT 8-shot, strict-match 91.36% ± 0.77 91.51% ± 0.77 −0.15 pp
GSM8K-CoT 8-shot, flexible-extract 92.27% ± 0.74 91.21% ± 0.78 +1.06 pp
MMLU 0-shot 88.63% ± 0.26 89.03% ± 0.25 −0.40 pp

NVFP4 quantization is essentially lossless — every delta is within or near 1σ of stderr.

GSM8K-CoT under the alternative 3-shot, 3-seed-mean methodology used in some Kimi K2 evaluations was also run on the INT4 source for reference: 92.12% ± 0.45 flexible-extract, 87.22% ± 0.30 strict-match. Use the methodology that matches your reference; the relative NVFP4-vs-INT4 gap is consistent across both.

Hardware

Verified: 8× NVIDIA B300 SXM6 (Blackwell Ultra, sm_103a). vLLM serves cleanly with TP=4 and --max-model-len 8192 at --gpu-memory-utilization 0.85.

Should also work but not directly tested in this run: Blackwell B200 (sm_100). NVFP4 GEMM kernels in vLLM target sm_100+; if your build of vLLM has Blackwell-base kernels you can reasonably expect this checkpoint to load and run there. Hopper (H100/H200) and earlier GPUs do not have NVFP4 hardware support and will not work.

Serving

The recipe below uses NVIDIA's publicly-pullable NGC vLLM container — no NGC account required.

docker run --rm --gpus all --ipc=host \
    -v /path/to/model:/model \
    -p 8001:8001 \
    --entrypoint="" \
    nvcr.io/nvidia/vllm:26.03.post1-py3 \
    python -m vllm.entrypoints.openai.api_server \
        --model /model \
        --tensor-parallel-size 4 \
        --max-model-len 8192 \
        --gpu-memory-utilization 0.85 \
        --trust-remote-code \
        --tool-call-parser kimi_k2 --reasoning-parser kimi_k2 \
        --enforce-eager \
        --moe-backend cutlass \
        --port 8001

Tune --tensor-parallel-size, --max-model-len, and --gpu-memory-utilization for your hardware. Higher --max-model-len (up to 32768, the model's positional ceiling) is feasible if you have headroom; reduce --gpu-memory-utilization if vLLM warns about KV cache.

Quantization recipe (for reproducibility)

  • Source: moonshotai/Kimi-K2.6 (W4A16 compressed-tensors pack-quantized INT4, group=32) decompressed to a BF16 intermediate.
  • Tool: nvidia-modelopt==0.41.0 (examples/llm_ptq/hf_ptq.py) — same major version NVIDIA used for nvidia/Kimi-K2.5-NVFP4.
  • Mode: device_map="auto" + --use_seq_device_map (no --low_memory_mode); model loaded fully resident in BF16 across 8 GPUs at calibration time.
  • Calibration: cnn_dailymail, 512 samples, sequence length 512.
  • Algorithm: max (per mtq.NVFP4_DEFAULT_CFG).
  • KV cache: FP8 E4M3 (mtq.FP8_KV_CFG).

Limitations

  • Calibration dataset is cnn_dailymail (general English news). For best quality on domain-specific or multilingual workloads, recalibrate on representative data.
  • Quantization noise is concentrated in the language-model MoE experts; multimodal quality is unaffected.
  • Evaluations above are base-completion lm-eval, not chat-template. Numbers may differ when the model is invoked through its chat template; for production, evaluate in your serving format.
  • --enforce-eager is set in the verified serving recipe to avoid CUDA graph compilation issues we observed under concurrent load with this build of vLLM. Performance is therefore not optimal; users with a more recent vLLM may be able to drop this flag.

Ethical considerations

This quantization preserves the linguistic and behavioral characteristics of the upstream moonshotai/Kimi-K2.6 checkpoint. Any biases, factual errors, or unsafe behaviors present in the upstream model are preserved here — quantization neither introduces nor mitigates them. Evaluate the model in your deployment context before serving to end users, and apply use-case-appropriate safety filtering on top.

Attribution and License

Weights are derived from moonshotai/Kimi-K2.6, licensed under the Modified MIT (Kimi) License. The LICENSE file is preserved in this repository verbatim from the upstream release; all upstream attributions and notices apply.

For upstream model details, training data, capabilities, and intended use, see the original model card at moonshotai/Kimi-K2.6.

Quantization tooling: NVIDIA TensorRT Model Optimizer. Serving container: NVIDIA NGC vLLM.

Citation

If this quantization is useful in your work, please cite upstream Kimi-K2.6 and, optionally, this quant:

@misc{moonshot_kimi_k26,
  title = {Kimi K2.6},
  author = {Moonshot AI},
  howpublished = {\url{https://huggingface.co/moonshotai/Kimi-K2.6}},
  year = {2026}
}
@misc{waferai_kimi_k26_nvfp4,
  title = {Kimi-K2.6-NVFP4},
  author = {Wafer AI},
  howpublished = {\url{https://huggingface.co/wafer-ai/Kimi-K2.6-NVFP4}},
  year = {2026}
}
Downloads last month
4,586
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wafer-ai/Kimi-K2.6-NVFP4

Quantized
(33)
this model