Qwen3.6-27B — MLX 4-bit with native Multi-Token Prediction (MTP) weights

A 4-bit MLX conversion of Qwen/Qwen3.6-27B that preserves the mtp.* head weights, so it can be served with native in-graph speculative decoding via the open pull request ml-explore/mlx-lm#990 ("Native MTP speculative decoding, Qwen3.5/3.6 reference implementation" by @AirRunner).

Why this checkpoint exists

Qwen3.5/3.6 ship with a built-in MTP head (mtp_num_hidden_layers: 1 in the config) that predicts token t+2 from the backbone hidden state at t and the embedding of token t+1. PR #990 turns those weights into a working spec-decode loop — ~1.5× single-stream peak throughput, ≥80 % draft acceptance, identical greedy output vs. plain decode.

The popular mlx-community/Qwen3.6-27B-4bit checkpoint was converted with an older sanitize() that strips mtp.* weights. This repo is a fresh BF16 → 4-bit conversion done with the PR-branch sanitize(), keeping all 31 mtp.* tensors intact (mtp.fc.{weight,scales,biases} + the 1-layer transformer block).

total tensors in this checkpoint : 1878
  of which mtp.*                 :   31   (preserved)
bits per weight (effective)      : 4.501

Quick start

You need the PR-branch build of mlx-lm (until the PR lands in a release):

git clone -b feat/mtp-native https://github.com/AirRunner/mlx-lm.git
pip install -e ./mlx-lm

Then serve or generate with --mtp:

# OpenAI-compatible HTTP server
mlx_lm.server --model kradih/Qwen3.6-27B-MTP-4bit-MLX --host 127.0.0.1 --port 8081 --mtp

# One-shot generate
mlx_lm.generate --model kradih/Qwen3.6-27B-MTP-4bit-MLX --mtp \
  --prompt "Why is single-stream decode bandwidth-bound? Think step by step."

Reasoning ("thinking") is enabled by default — the Qwen3.6 chat template emits <think>...</think> content; in the streamed OpenAI response it arrives in delta.reasoning.

The checkpoint also loads fine with stock mlx-lm releases (without --mtp); MTP weights are simply ignored and you get the same throughput as plain 4-bit MLX serving.

Performance — measured

Tested on a MacBook Pro M4 Max (16-core CPU 12P+4E / 40-core GPU / 546 GB/s unified memory / 128 GB unified), macOS 26.3.1, AC plugged in, external 5K display attached, mlx-lm @ PR #990 commit f840f6c. Methodology: single stream, concurrency = 1, fixed seed, 1024-token input → 512-token output, temperature 0.7, thinking enabled, 30 trials + 3 warmup, ignore_eos so generation length is bounded.

Stack	out tok/s p50	TPOT (pure decode)	TTFT p50	First-3-trial peak	Drift across 30 trials
`mlx-community/Qwen3.6-27B-4bit` (plain decode)	20.99	41.8 ms (= 23.9 tok/s)	3058 ms	22.0	+3.4 % ✅
This repo + `mlx-lm --mtp` (PR #990)	16.4 – 19.1	45 – 53 ms steady-state	3700 – 4100 ms	31 – 34	-8 to +37 % depending on `--prompt-cache-size`
llama.cpp Q4_K_M (Metal, q8_0 KV)	14.45	62.9 ms	3260 ms	15.9	+6.6 % ✅
Ollama `qwen3.6:27b` (Q4_K_M + q8_0 KV + FA)	13.83	65.9 ms	3212 ms	14.7	+8.9 % ✅

MTP shows a 1.5× cold-start advantage (31–34 tok/s for the first 3–5 requests, matching the PR author's own M4 Pro numbers when scaled by bandwidth) but degrades to baseline-or-below by trial 10, even with --prompt-cache-size 0. This is consistent across repeated runs and is not the LRU prompt cache (the cache stays at 0 sequences). Suspected cause is MLX Metal-allocator state pollution across requests; PR #990's commit ffac4333 ("clear Metal allocator cache every 256 tokens during decode") addresses the within-request case but not the inter-request one. Issue surfaced on M4 Max 40-core GPU — has not been triaged upstream as of 2026-05-19.

Bottom line: for genuine production single-stream serving, plain MLX 4-bit currently beats MTP on this hardware. For burst/interactive workloads where the first few responses dominate user-perceived latency, MTP shines.

Author's reference numbers from PR #990 (same model, on a smaller M4 Pro with ~273 GB/s bandwidth): baseline 15.7 → MTP 24.6 tok/s, 88 % draft acceptance at temp = 0, 80 % at temp = 1.0.

Quantization details

Field	Value
Source	`Qwen/Qwen3.6-27B` BF16 (≈55.6 GB on disk, 15 shards)
Weight bits	4
Group size	64
Quantization mode	`affine`
Effective bpw	4.501
MTP head	preserved (31 `language_model.mtp.*` tensors retained)
Format	MLX safetensors (3 × ~5 GB shards)
On-disk size	~14 GB
Conversion command	`mlx_lm convert --hf-path Qwen/Qwen3.6-27B --mlx-path out --quantize --q-bits 4 --q-group-size 64` (with PR #990's branch installed)
Convert wall time on M4 Max	12.9 s

Context window

Native context window of the base model: 262,144 tokens (declared in config.json). Memory required for the full window at fp16 KV cache is ~67 GB; on 128 GB systems that leaves >40 GB headroom above the 20 GB OS reserve. Tested up to ~9k-token prompts on M4 Max with no measured throttling.

Memory footprint (measured)

State	Resident memory
Server idle (weights loaded)	~14.5 GB
~1k-token prompt active	~14.5–15 GB
Full 262144-token KV cache (fp16, projected)	~82 GB total (weights + KV)

Limitations / caveats

PR #990 is not yet merged. Until then, you need AirRunner's branch installed — stock mlx-lm releases will load this checkpoint but will not use the MTP head (you still get a clean 4-bit checkpoint with the same throughput as mlx-community/Qwen3.6-27B-4bit).
Cross-request degradation: see "Performance — measured" above. Open issue, not specific to this checkpoint.
M1 / M2 Macs: if you do not quantise mtp.fc, pass --dtype float16 to mlx_lm convert; otherwise BF16 ops on M1/M2 hit a slow CPU fallback. M3/M4 are unaffected — they have native BF16 GPU support.
Multimodal: the base model is image-text-to-text. This conversion preserves the vision tower; if you only need text serving, the vision weights add disk but no inference cost (loaded but unused).
Identity: under greedy decoding the MTP output is bit-exact equal to plain generate_step (PR #990's test_mtp.py). Under sampling, accepted draft tokens come from the same shared distribution as the verify pass.
Single-stream only. Numbers above are concurrency = 1. Multi-stream behaviour with MTP has not been benchmarked here.

Credits

Qwen team — original model weights, training, and the open Apache-2.0 release.
Luca Vaio / @AirRunner — the MTP integration in ml-explore/mlx-lm#990, the reference implementation, and the unit tests this conversion depends on.
mlx-community — the standing convention for MLX-format checkpoints (mlx_lm.convert defaults, group_size 64, affine quant).
Apple's MLX team — the runtime.

License

Apache-2.0, inherited from the base model.

Downloads last month: 1,774

Safetensors

Model size

27B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for kradih/Qwen3.6-27B-MTP-4bit-MLX

Base model

Qwen/Qwen3.6-27B

Quantized

(414)

this model