Multi-Token Prediction (MTP) Layers Grafted on INT8 Quantization

This repo is the TheHouseOfTheDude/Qwen3.6-27B-INT8 W8A16 PTQ quantization with unquantized BF16 MTP layers grafted on top.

Base quantization: TheHouseOfTheDude Qwen3.6-27B-INT8 — INT8 weights, FP16/BF16 activations via compressed-tensors (pack-quantized format). See the original repo for quantization details and KLD results.

MTP layers: Taken from Qwen/Qwen3.6-27B in full BF16 precision and added as a second safetensors file (model-00002-of-00002.safetensors).

Chat template (v9): Updated from froggeric/Qwen-Fixed-Chat-Templates. Fixes broken </think> tags, misplaced tool calls inside think blocks, and other structural issues that break agentic coding and OpenClaw usage.

Why BF16 for MTP? MTP layers were excluded from the original quantization recipe (ignore: ['re:.*mtp.*']). Keeping them unquantized avoids any loss of draft-token quality while adding only ~850 MB of overhead.

config.json includes re:.*mtp.* in the quantization_config.ignore list so vLLM skips quantization wrapping for MTP Linear layers during loading.

To serve with MTP enabled, use:
vllm serve havenoammo/Qwen3.6-27B-INT8-MTP \
  --tensor-parallel-size 2 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}'

Qwen3.6-27B-INT8-MTP

Post-Training Quantized (PTQ) version of Qwen/Qwen3.6-27B with Multi-Token Prediction support for vLLM.

Base quantization: W8A16 (INT8 weights, FP16/BF16 activations) by TheHouseOfTheDude
MTP layers: 1 MTP layer in BF16, grafted from Qwen/Qwen3.6-27B
Chat template: v9 from froggeric/Qwen-Fixed-Chat-Templates — fixes agentic coding and OpenClaw compatibility
Format: compressed-tensors / pack-quantized (2-file sharded safetensors)
Mean KLD: 0.009441

W8A16 (INT8 weights, FP16/BF16 activations) runs efficiently on Ampere (A100/RTX 3090) and newer GPUs. The INT8 weights halve the weight memory compared to BF16, while activations stay in FP16/BF16 for full precision. Pair with --kv-cache-dtype fp8_e4m3 to cut KV cache memory further on supported hardware.

Files

File	Size	Contents
`model-00001-of-00002.safetensors`	~33 GB	INT8 quantized base model weights
`model-00002-of-00002.safetensors`	~850 MB	BF16 MTP layer weights
`model.safetensors.index.json`	—	Weight index mapping

Usage

Adjust --tensor-parallel-size and CUDA_VISIBLE_DEVICES to match your GPU setup. This model requires at least 2 × 24 GB GPUs (e.g. dual RTX 3090/4090) with TP=2. The INT8 weights are ~33 GB plus ~850 MB for MTP layers, plus KV cache and activation overhead. For larger setups, scale up to TP=4 or TP=8 accordingly.

Minimum Setup (2 × 24 GB GPUs)

CUDA_VISIBLE_DEVICES=0,1 vllm serve havenoammo/Qwen3.6-27B-INT8-MTP \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8_e4m3 \
  --max-model-len 262144 \
  --language-model-only \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
  --default-chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}' \
  --attention-backend FLASHINFER \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_xml

Full Reference Command (all features)

CUDA_VISIBLE_DEVICES=0,1 vllm serve havenoammo/Qwen3.6-27B-INT8-MTP \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.95 \
  --trust-remote-code \
  --max-model-len 200000 \
  --max-num-seqs 2 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-batched-tokens 4096 \
  --language-model-only \
  --performance-mode interactivity \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
  --skip-mm-profiling \
  --attention-backend FLASHINFER \
  --async-scheduling \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --mamba-cache-mode all \
  --mamba-block-size 8 \
  --reasoning-parser qwen3 \
  --kv-cache-metrics \
  --enable-prompt-tokens-details \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --default-chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}'

Downloads last month: 5,099

Safetensors

Model size

28B params

Tensor type

I64

I32

BF16

Model tree for havenoammo/Qwen3.6-27B-INT8-MTP

Base model

Qwen/Qwen3.6-27B

Quantized

(408)

this model