Multi-Token Prediction (MTP) Layers Grafted on INT8 Quantization
This repo is the TheHouseOfTheDude/Qwen3.6-27B-INT8 W8A16 PTQ quantization with unquantized BF16 MTP layers grafted on top.
- Base quantization: TheHouseOfTheDude Qwen3.6-27B-INT8 — INT8 weights, FP16/BF16 activations via
compressed-tensors(pack-quantized format). See the original repo for quantization details and KLD results.- MTP layers: Taken from Qwen/Qwen3.6-27B in full BF16 precision and added as a second safetensors file (
model-00002-of-00002.safetensors).- Chat template (v9): Updated from froggeric/Qwen-Fixed-Chat-Templates. Fixes broken
</think>tags, misplaced tool calls inside think blocks, and other structural issues that break agentic coding and OpenClaw usage.- Why BF16 for MTP? MTP layers were excluded from the original quantization recipe (
ignore: ['re:.*mtp.*']). Keeping them unquantized avoids any loss of draft-token quality while adding only ~850 MB of overhead.config.jsonincludesre:.*mtp.*in thequantization_config.ignorelist so vLLM skips quantization wrapping for MTP Linear layers during loading.To serve with MTP enabled, use:
vllm serve havenoammo/Qwen3.6-27B-INT8-MTP \ --tensor-parallel-size 2 \ --speculative-config '{"method":"mtp","num_speculative_tokens":2}'
Qwen3.6-27B-INT8-MTP
Post-Training Quantized (PTQ) version of Qwen/Qwen3.6-27B with Multi-Token Prediction support for vLLM.
- Base quantization: W8A16 (INT8 weights, FP16/BF16 activations) by TheHouseOfTheDude
- MTP layers: 1 MTP layer in BF16, grafted from Qwen/Qwen3.6-27B
- Chat template: v9 from froggeric/Qwen-Fixed-Chat-Templates — fixes agentic coding and OpenClaw compatibility
- Format:
compressed-tensors/pack-quantized(2-file sharded safetensors) - Mean KLD: 0.009441
W8A16 (INT8 weights, FP16/BF16 activations) runs efficiently on Ampere (A100/RTX 3090) and newer GPUs. The INT8 weights halve the weight memory compared to BF16, while activations stay in FP16/BF16 for full precision. Pair with
--kv-cache-dtype fp8_e4m3to cut KV cache memory further on supported hardware.
Files
| File | Size | Contents |
|---|---|---|
model-00001-of-00002.safetensors |
~33 GB | INT8 quantized base model weights |
model-00002-of-00002.safetensors |
~850 MB | BF16 MTP layer weights |
model.safetensors.index.json |
— | Weight index mapping |
Usage
Adjust
--tensor-parallel-sizeandCUDA_VISIBLE_DEVICESto match your GPU setup. This model requires at least 2 × 24 GB GPUs (e.g. dual RTX 3090/4090) with TP=2. The INT8 weights are ~33 GB plus ~850 MB for MTP layers, plus KV cache and activation overhead. For larger setups, scale up to TP=4 or TP=8 accordingly.
Minimum Setup (2 × 24 GB GPUs)
CUDA_VISIBLE_DEVICES=0,1 vllm serve havenoammo/Qwen3.6-27B-INT8-MTP \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.95 \
--kv-cache-dtype fp8_e4m3 \
--max-model-len 262144 \
--language-model-only \
--speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
--default-chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}' \
--attention-backend FLASHINFER \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_xml
Full Reference Command (all features)
CUDA_VISIBLE_DEVICES=0,1 vllm serve havenoammo/Qwen3.6-27B-INT8-MTP \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.95 \
--trust-remote-code \
--max-model-len 200000 \
--max-num-seqs 2 \
--kv-cache-dtype fp8_e4m3 \
--max-num-batched-tokens 4096 \
--language-model-only \
--performance-mode interactivity \
--speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
--skip-mm-profiling \
--attention-backend FLASHINFER \
--async-scheduling \
--enable-chunked-prefill \
--enable-prefix-caching \
--mamba-cache-mode all \
--mamba-block-size 8 \
--reasoning-parser qwen3 \
--kv-cache-metrics \
--enable-prompt-tokens-details \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--default-chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}'
- Downloads last month
- 5,099
Model tree for havenoammo/Qwen3.6-27B-INT8-MTP
Base model
Qwen/Qwen3.6-27B