Qwen3.6-27B — MLX 4-bit with native Multi-Token Prediction (MTP) weights

A 4-bit MLX conversion of Qwen/Qwen3.6-27B that preserves the mtp.* head weights, so it can be served with native in-graph speculative decoding via the open pull request ml-explore/mlx-lm#990 ("Native MTP speculative decoding, Qwen3.5/3.6 reference implementation" by @AirRunner).

Why this checkpoint exists

Qwen3.5/3.6 ship with a built-in MTP head (mtp_num_hidden_layers: 1 in the config) that predicts token t+2 from the backbone hidden state at t and the embedding of token t+1. PR #990 turns those weights into a working spec-decode loop — ~1.5× single-stream peak throughput, ≥80 % draft acceptance, identical greedy output vs. plain decode.

The popular mlx-community/Qwen3.6-27B-4bit checkpoint was converted with an older sanitize() that strips mtp.* weights. This repo is a fresh BF16 → 4-bit conversion done with the PR-branch sanitize(), keeping all 31 mtp.* tensors intact (mtp.fc.{weight,scales,biases} + the 1-layer transformer block).

total tensors in this checkpoint : 1878
  of which mtp.*                 :   31   (preserved)
bits per weight (effective)      : 4.501

Quick start

You need the PR-branch build of mlx-lm (until the PR lands in a release):

git clone -b feat/mtp-native https://github.com/AirRunner/mlx-lm.git
pip install -e ./mlx-lm

Then serve or generate with --mtp:

# OpenAI-compatible HTTP server
mlx_lm.server --model kradih/Qwen3.6-27B-MTP-4bit-MLX --host 127.0.0.1 --port 8081 --mtp

# One-shot generate
mlx_lm.generate --model kradih/Qwen3.6-27B-MTP-4bit-MLX --mtp \
  --prompt "Why is single-stream decode bandwidth-bound? Think step by step."

Reasoning ("thinking") is enabled by default — the Qwen3.6 chat template emits <think>...</think> content; in the streamed OpenAI response it arrives in delta.reasoning.

The checkpoint also loads fine with stock mlx-lm releases (without --mtp); MTP weights are simply ignored and you get the same throughput as plain 4-bit MLX serving.

Performance — measured

Tested on a MacBook Pro M4 Max (16-core CPU 12P+4E / 40-core GPU / 546 GB/s unified memory / 128 GB unified), macOS 26.3.1, AC plugged in, external 5K display attached, mlx-lm @ PR #990 commit f840f6c. Methodology: single stream, concurrency = 1, fixed seed, 1024-token input → 512-token output, temperature 0.7, thinking enabled, 30 trials + 3 warmup, ignore_eos so generation length is bounded.

Stack out tok/s p50 TPOT (pure decode) TTFT p50 First-3-trial peak Drift across 30 trials
mlx-community/Qwen3.6-27B-4bit (plain decode) 20.99 41.8 ms (= 23.9 tok/s) 3058 ms 22.0 +3.4 % ✅
This repo + mlx-lm --mtp (PR #990) 16.4 – 19.1 45 – 53 ms steady-state 3700 – 4100 ms 31 – 34 -8 to +37 % depending on --prompt-cache-size
llama.cpp Q4_K_M (Metal, q8_0 KV) 14.45 62.9 ms 3260 ms 15.9 +6.6 % ✅
Ollama qwen3.6:27b (Q4_K_M + q8_0 KV + FA) 13.83 65.9 ms 3212 ms 14.7 +8.9 % ✅

MTP shows a 1.5× cold-start advantage (31–34 tok/s for the first 3–5 requests, matching the PR author's own M4 Pro numbers when scaled by bandwidth) but degrades to baseline-or-below by trial 10, even with --prompt-cache-size 0. This is consistent across repeated runs and is not the LRU prompt cache (the cache stays at 0 sequences). Suspected cause is MLX Metal-allocator state pollution across requests; PR #990's commit ffac4333 ("clear Metal allocator cache every 256 tokens during decode") addresses the within-request case but not the inter-request one. Issue surfaced on M4 Max 40-core GPU — has not been triaged upstream as of 2026-05-19.

Bottom line: for genuine production single-stream serving, plain MLX 4-bit currently beats MTP on this hardware. For burst/interactive workloads where the first few responses dominate user-perceived latency, MTP shines.

Author's reference numbers from PR #990 (same model, on a smaller M4 Pro with ~273 GB/s bandwidth): baseline 15.7 → MTP 24.6 tok/s, 88 % draft acceptance at temp = 0, 80 % at temp = 1.0.

Quantization details

Field Value
Source Qwen/Qwen3.6-27B BF16 (≈55.6 GB on disk, 15 shards)
Weight bits 4
Group size 64
Quantization mode affine
Effective bpw 4.501
MTP head preserved (31 language_model.mtp.* tensors retained)
Format MLX safetensors (3 × ~5 GB shards)
On-disk size ~14 GB
Conversion command mlx_lm convert --hf-path Qwen/Qwen3.6-27B --mlx-path out --quantize --q-bits 4 --q-group-size 64 (with PR #990's branch installed)
Convert wall time on M4 Max 12.9 s

Context window

Native context window of the base model: 262,144 tokens (declared in config.json). Memory required for the full window at fp16 KV cache is ~67 GB; on 128 GB systems that leaves >40 GB headroom above the 20 GB OS reserve. Tested up to ~9k-token prompts on M4 Max with no measured throttling.

Memory footprint (measured)

State Resident memory
Server idle (weights loaded) ~14.5 GB
~1k-token prompt active ~14.5–15 GB
Full 262144-token KV cache (fp16, projected) ~82 GB total (weights + KV)

Limitations / caveats

  • PR #990 is not yet merged. Until then, you need AirRunner's branch installed — stock mlx-lm releases will load this checkpoint but will not use the MTP head (you still get a clean 4-bit checkpoint with the same throughput as mlx-community/Qwen3.6-27B-4bit).
  • Cross-request degradation: see "Performance — measured" above. Open issue, not specific to this checkpoint.
  • M1 / M2 Macs: if you do not quantise mtp.fc, pass --dtype float16 to mlx_lm convert; otherwise BF16 ops on M1/M2 hit a slow CPU fallback. M3/M4 are unaffected — they have native BF16 GPU support.
  • Multimodal: the base model is image-text-to-text. This conversion preserves the vision tower; if you only need text serving, the vision weights add disk but no inference cost (loaded but unused).
  • Identity: under greedy decoding the MTP output is bit-exact equal to plain generate_step (PR #990's test_mtp.py). Under sampling, accepted draft tokens come from the same shared distribution as the verify pass.
  • Single-stream only. Numbers above are concurrency = 1. Multi-stream behaviour with MTP has not been benchmarked here.

Credits

  • Qwen team — original model weights, training, and the open Apache-2.0 release.
  • Luca Vaio / @AirRunner — the MTP integration in ml-explore/mlx-lm#990, the reference implementation, and the unit tests this conversion depends on.
  • mlx-community — the standing convention for MLX-format checkpoints (mlx_lm.convert defaults, group_size 64, affine quant).
  • Apple's MLX team — the runtime.

License

Apache-2.0, inherited from the base model.

Downloads last month
1,774
Safetensors
Model size
27B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kradih/Qwen3.6-27B-MTP-4bit-MLX

Base model

Qwen/Qwen3.6-27B
Quantized
(414)
this model