Instructions to use wafer-ai/Kimi-K2.6-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use wafer-ai/Kimi-K2.6-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="wafer-ai/Kimi-K2.6-NVFP4", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("wafer-ai/Kimi-K2.6-NVFP4", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use wafer-ai/Kimi-K2.6-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "wafer-ai/Kimi-K2.6-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wafer-ai/Kimi-K2.6-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/wafer-ai/Kimi-K2.6-NVFP4

SGLang

How to use wafer-ai/Kimi-K2.6-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "wafer-ai/Kimi-K2.6-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wafer-ai/Kimi-K2.6-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "wafer-ai/Kimi-K2.6-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wafer-ai/Kimi-K2.6-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use wafer-ai/Kimi-K2.6-NVFP4 with Docker Model Runner:
```
docker model run hf.co/wafer-ai/Kimi-K2.6-NVFP4
```

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Kimi-K2.6-NVFP4

NVFP4 post-training quantization of moonshotai/Kimi-K2.6 for serving on NVIDIA Blackwell hardware with vLLM.

Third-party model. Weights are derived from moonshotai/Kimi-K2.6 (Moonshot AI). Wafer AI is not affiliated with Moonshot AI; this checkpoint is a community quantization. Use of this model is governed by the upstream Modified MIT License.

At a glance


Architecture	MoE (DeepSeek-V3 style) with Kimi vision tower; ~1T total parameters
Weight format	NVFP4 (FP4 E2M1, 16-element block scale in FP8 E4M3, FP32 per-tensor scale — all calibrated)
Activation format	NVFP4 (calibrated FP32 per-tensor scale; FP8 E4M3 per-16-element block scale computed at inference)
KV cache	FP8 E4M3
Excluded from quant (kept BF16)	`lm_head`, all `self_attn` projections (MLA), `vision_tower` (covers MoonViT), `mm_projector` — set in `quantization_config.ignore`. MoE routers (`mlp.gate.weight`) are also BF16: `modelopt` leaves them as scoring layers despite not being in the explicit ignore list.
Storage	~590 GB (550 GiB) across 119 safetensors shards
License	Modified MIT (inherits from upstream — see `LICENSE`)

Evaluation

Evaluated against the upstream moonshotai/Kimi-K2.6 (W4A16 compressed-tensors pack-quantized INT4, group=32) source. Both endpoints served via the NGC container nvcr.io/nvidia/vllm:26.03.post1-py3 (vLLM 0.17.1+bd67d66a.nvinternal), --enforce-eager, --moe-backend cutlass, FP8 KV cache, on 8× NVIDIA B300 SXM6.

Eval harness: lm-evaluation-harness 0.4.11 via the local-completions adapter, base-completion mode, num_concurrent=4, batch_size=1. Default lm-eval seed (0) and gen_kwargs.

Benchmark	This (NVFP4)	INT4 source	Δ
GSM8K-CoT 8-shot, strict-match	91.36% ± 0.77	91.51% ± 0.77	−0.15 pp
GSM8K-CoT 8-shot, flexible-extract	92.27% ± 0.74	91.21% ± 0.78	+1.06 pp
MMLU 0-shot	88.63% ± 0.26	89.03% ± 0.25	−0.40 pp

NVFP4 quantization is essentially lossless — every delta is within or near 1σ of stderr.

GSM8K-CoT under the alternative 3-shot, 3-seed-mean methodology used in some Kimi K2 evaluations was also run on the INT4 source for reference: 92.12% ± 0.45 flexible-extract, 87.22% ± 0.30 strict-match. Use the methodology that matches your reference; the relative NVFP4-vs-INT4 gap is consistent across both.

Hardware

Verified: 8× NVIDIA B300 SXM6 (Blackwell Ultra, sm_103a). vLLM serves cleanly with TP=4 and --max-model-len 8192 at --gpu-memory-utilization 0.85.

Should also work but not directly tested in this run: Blackwell B200 (sm_100). NVFP4 GEMM kernels in vLLM target sm_100+; if your build of vLLM has Blackwell-base kernels you can reasonably expect this checkpoint to load and run there. Hopper (H100/H200) and earlier GPUs do not have NVFP4 hardware support and will not work.

Serving

The recipe below uses NVIDIA's publicly-pullable NGC vLLM container — no NGC account required.

docker run --rm --gpus all --ipc=host \
    -v /path/to/model:/model \
    -p 8001:8001 \
    --entrypoint="" \
    nvcr.io/nvidia/vllm:26.03.post1-py3 \
    python -m vllm.entrypoints.openai.api_server \
        --model /model \
        --tensor-parallel-size 4 \
        --max-model-len 8192 \
        --gpu-memory-utilization 0.85 \
        --trust-remote-code \
        --tool-call-parser kimi_k2 --reasoning-parser kimi_k2 \
        --enforce-eager \
        --moe-backend cutlass \
        --port 8001

Tune --tensor-parallel-size, --max-model-len, and --gpu-memory-utilization for your hardware. Higher --max-model-len (up to 32768, the model's positional ceiling) is feasible if you have headroom; reduce --gpu-memory-utilization if vLLM warns about KV cache.

Quantization recipe (for reproducibility)

Source: moonshotai/Kimi-K2.6 (W4A16 compressed-tensors pack-quantized INT4, group=32) decompressed to a BF16 intermediate.
Tool: nvidia-modelopt==0.41.0 (examples/llm_ptq/hf_ptq.py) — same major version NVIDIA used for nvidia/Kimi-K2.5-NVFP4.
Mode: device_map="auto" + --use_seq_device_map (no --low_memory_mode); model loaded fully resident in BF16 across 8 GPUs at calibration time.
Calibration: cnn_dailymail, 512 samples, sequence length 512.
Algorithm: max (per mtq.NVFP4_DEFAULT_CFG).
KV cache: FP8 E4M3 (mtq.FP8_KV_CFG).

Limitations

Calibration dataset is cnn_dailymail (general English news). For best quality on domain-specific or multilingual workloads, recalibrate on representative data.
Quantization noise is concentrated in the language-model MoE experts; multimodal quality is unaffected.
Evaluations above are base-completion lm-eval, not chat-template. Numbers may differ when the model is invoked through its chat template; for production, evaluate in your serving format.
--enforce-eager is set in the verified serving recipe to avoid CUDA graph compilation issues we observed under concurrent load with this build of vLLM. Performance is therefore not optimal; users with a more recent vLLM may be able to drop this flag.

Ethical considerations

This quantization preserves the linguistic and behavioral characteristics of the upstream moonshotai/Kimi-K2.6 checkpoint. Any biases, factual errors, or unsafe behaviors present in the upstream model are preserved here — quantization neither introduces nor mitigates them. Evaluate the model in your deployment context before serving to end users, and apply use-case-appropriate safety filtering on top.

Attribution and License

Weights are derived from moonshotai/Kimi-K2.6, licensed under the Modified MIT (Kimi) License. The LICENSE file is preserved in this repository verbatim from the upstream release; all upstream attributions and notices apply.

For upstream model details, training data, capabilities, and intended use, see the original model card at moonshotai/Kimi-K2.6.

Quantization tooling: NVIDIA TensorRT Model Optimizer. Serving container: NVIDIA NGC vLLM.

Citation

If this quantization is useful in your work, please cite upstream Kimi-K2.6 and, optionally, this quant:

@misc{moonshot_kimi_k26,
  title = {Kimi K2.6},
  author = {Moonshot AI},
  howpublished = {\url{https://huggingface.co/moonshotai/Kimi-K2.6}},
  year = {2026}
}
@misc{waferai_kimi_k26_nvfp4,
  title = {Kimi-K2.6-NVFP4},
  author = {Wafer AI},
  howpublished = {\url{https://huggingface.co/wafer-ai/Kimi-K2.6-NVFP4}},
  year = {2026}
}

Downloads last month: 4,586

Model tree for wafer-ai/Kimi-K2.6-NVFP4

Base model

moonshotai/Kimi-K2.6

Quantized

(33)

this model