Multi-Token Prediction (MTP) Layers Grafted on INT8 Quantization

This repo is the TheHouseOfTheDude/Qwen3.6-27B-INT8 W8A16 PTQ quantization with unquantized BF16 MTP layers grafted on top.

  • Base quantization: TheHouseOfTheDude Qwen3.6-27B-INT8 — INT8 weights, FP16/BF16 activations via compressed-tensors (pack-quantized format). See the original repo for quantization details and KLD results.
  • MTP layers: Taken from Qwen/Qwen3.6-27B in full BF16 precision and added as a second safetensors file (model-00002-of-00002.safetensors).
  • Chat template (v9): Updated from froggeric/Qwen-Fixed-Chat-Templates. Fixes broken </think> tags, misplaced tool calls inside think blocks, and other structural issues that break agentic coding and OpenClaw usage.
  • Why BF16 for MTP? MTP layers were excluded from the original quantization recipe (ignore: ['re:.*mtp.*']). Keeping them unquantized avoids any loss of draft-token quality while adding only ~850 MB of overhead.
  • config.json includes re:.*mtp.* in the quantization_config.ignore list so vLLM skips quantization wrapping for MTP Linear layers during loading.

To serve with MTP enabled, use:

vllm serve havenoammo/Qwen3.6-27B-INT8-MTP \
  --tensor-parallel-size 2 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}'

Qwen3.6-27B-INT8-MTP

Post-Training Quantized (PTQ) version of Qwen/Qwen3.6-27B with Multi-Token Prediction support for vLLM.

  • Base quantization: W8A16 (INT8 weights, FP16/BF16 activations) by TheHouseOfTheDude
  • MTP layers: 1 MTP layer in BF16, grafted from Qwen/Qwen3.6-27B
  • Chat template: v9 from froggeric/Qwen-Fixed-Chat-Templates — fixes agentic coding and OpenClaw compatibility
  • Format: compressed-tensors / pack-quantized (2-file sharded safetensors)
  • Mean KLD: 0.009441

W8A16 (INT8 weights, FP16/BF16 activations) runs efficiently on Ampere (A100/RTX 3090) and newer GPUs. The INT8 weights halve the weight memory compared to BF16, while activations stay in FP16/BF16 for full precision. Pair with --kv-cache-dtype fp8_e4m3 to cut KV cache memory further on supported hardware.

Files

File Size Contents
model-00001-of-00002.safetensors ~33 GB INT8 quantized base model weights
model-00002-of-00002.safetensors ~850 MB BF16 MTP layer weights
model.safetensors.index.json Weight index mapping

Usage

Adjust --tensor-parallel-size and CUDA_VISIBLE_DEVICES to match your GPU setup. This model requires at least 2 × 24 GB GPUs (e.g. dual RTX 3090/4090) with TP=2. The INT8 weights are ~33 GB plus ~850 MB for MTP layers, plus KV cache and activation overhead. For larger setups, scale up to TP=4 or TP=8 accordingly.

Minimum Setup (2 × 24 GB GPUs)

CUDA_VISIBLE_DEVICES=0,1 vllm serve havenoammo/Qwen3.6-27B-INT8-MTP \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8_e4m3 \
  --max-model-len 262144 \
  --language-model-only \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
  --default-chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}' \
  --attention-backend FLASHINFER \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_xml

Full Reference Command (all features)

CUDA_VISIBLE_DEVICES=0,1 vllm serve havenoammo/Qwen3.6-27B-INT8-MTP \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.95 \
  --trust-remote-code \
  --max-model-len 200000 \
  --max-num-seqs 2 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-batched-tokens 4096 \
  --language-model-only \
  --performance-mode interactivity \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
  --skip-mm-profiling \
  --attention-backend FLASHINFER \
  --async-scheduling \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --mamba-cache-mode all \
  --mamba-block-size 8 \
  --reasoning-parser qwen3 \
  --kv-cache-metrics \
  --enable-prompt-tokens-details \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --default-chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}'
Downloads last month
5,099
Safetensors
Model size
28B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for havenoammo/Qwen3.6-27B-INT8-MTP

Base model

Qwen/Qwen3.6-27B
Quantized
(408)
this model