Instructions to use kradih/Qwen3.6-27B-MTP-4bit-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use kradih/Qwen3.6-27B-MTP-4bit-MLX with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("kradih/Qwen3.6-27B-MTP-4bit-MLX") config = load_config("kradih/Qwen3.6-27B-MTP-4bit-MLX") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi
How to use kradih/Qwen3.6-27B-MTP-4bit-MLX with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "kradih/Qwen3.6-27B-MTP-4bit-MLX"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "kradih/Qwen3.6-27B-MTP-4bit-MLX" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use kradih/Qwen3.6-27B-MTP-4bit-MLX with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "kradih/Qwen3.6-27B-MTP-4bit-MLX"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default kradih/Qwen3.6-27B-MTP-4bit-MLX
Run Hermes
hermes
Qwen3.6-27B — MLX 4-bit with native Multi-Token Prediction (MTP) weights
A 4-bit MLX conversion of Qwen/Qwen3.6-27B that preserves the mtp.* head weights, so it can be served with native in-graph speculative decoding via the open pull request ml-explore/mlx-lm#990 ("Native MTP speculative decoding, Qwen3.5/3.6 reference implementation" by @AirRunner).
Why this checkpoint exists
Qwen3.5/3.6 ship with a built-in MTP head (mtp_num_hidden_layers: 1 in the config) that predicts token t+2 from the backbone hidden state at t and the embedding of token t+1. PR #990 turns those weights into a working spec-decode loop — ~1.5× single-stream peak throughput, ≥80 % draft acceptance, identical greedy output vs. plain decode.
The popular mlx-community/Qwen3.6-27B-4bit checkpoint was converted with an older sanitize() that strips mtp.* weights. This repo is a fresh BF16 → 4-bit conversion done with the PR-branch sanitize(), keeping all 31 mtp.* tensors intact (mtp.fc.{weight,scales,biases} + the 1-layer transformer block).
total tensors in this checkpoint : 1878
of which mtp.* : 31 (preserved)
bits per weight (effective) : 4.501
Quick start
You need the PR-branch build of mlx-lm (until the PR lands in a release):
git clone -b feat/mtp-native https://github.com/AirRunner/mlx-lm.git
pip install -e ./mlx-lm
Then serve or generate with --mtp:
# OpenAI-compatible HTTP server
mlx_lm.server --model kradih/Qwen3.6-27B-MTP-4bit-MLX --host 127.0.0.1 --port 8081 --mtp
# One-shot generate
mlx_lm.generate --model kradih/Qwen3.6-27B-MTP-4bit-MLX --mtp \
--prompt "Why is single-stream decode bandwidth-bound? Think step by step."
Reasoning ("thinking") is enabled by default — the Qwen3.6 chat template emits <think>...</think> content; in the streamed OpenAI response it arrives in delta.reasoning.
The checkpoint also loads fine with stock mlx-lm releases (without --mtp); MTP weights are simply ignored and you get the same throughput as plain 4-bit MLX serving.
Performance — measured
Tested on a MacBook Pro M4 Max (16-core CPU 12P+4E / 40-core GPU / 546 GB/s unified memory / 128 GB unified), macOS 26.3.1, AC plugged in, external 5K display attached, mlx-lm @ PR #990 commit f840f6c. Methodology: single stream, concurrency = 1, fixed seed, 1024-token input → 512-token output, temperature 0.7, thinking enabled, 30 trials + 3 warmup, ignore_eos so generation length is bounded.
| Stack | out tok/s p50 | TPOT (pure decode) | TTFT p50 | First-3-trial peak | Drift across 30 trials |
|---|---|---|---|---|---|
mlx-community/Qwen3.6-27B-4bit (plain decode) |
20.99 | 41.8 ms (= 23.9 tok/s) | 3058 ms | 22.0 | +3.4 % ✅ |
This repo + mlx-lm --mtp (PR #990) |
16.4 – 19.1 | 45 – 53 ms steady-state | 3700 – 4100 ms | 31 – 34 | -8 to +37 % depending on --prompt-cache-size |
| llama.cpp Q4_K_M (Metal, q8_0 KV) | 14.45 | 62.9 ms | 3260 ms | 15.9 | +6.6 % ✅ |
Ollama qwen3.6:27b (Q4_K_M + q8_0 KV + FA) |
13.83 | 65.9 ms | 3212 ms | 14.7 | +8.9 % ✅ |
MTP shows a 1.5× cold-start advantage (31–34 tok/s for the first 3–5 requests, matching the PR author's own M4 Pro numbers when scaled by bandwidth) but degrades to baseline-or-below by trial 10, even with --prompt-cache-size 0. This is consistent across repeated runs and is not the LRU prompt cache (the cache stays at 0 sequences). Suspected cause is MLX Metal-allocator state pollution across requests; PR #990's commit ffac4333 ("clear Metal allocator cache every 256 tokens during decode") addresses the within-request case but not the inter-request one. Issue surfaced on M4 Max 40-core GPU — has not been triaged upstream as of 2026-05-19.
Bottom line: for genuine production single-stream serving, plain MLX 4-bit currently beats MTP on this hardware. For burst/interactive workloads where the first few responses dominate user-perceived latency, MTP shines.
Author's reference numbers from PR #990 (same model, on a smaller M4 Pro with ~273 GB/s bandwidth): baseline 15.7 → MTP 24.6 tok/s, 88 % draft acceptance at temp = 0, 80 % at temp = 1.0.
Quantization details
| Field | Value |
|---|---|
| Source | Qwen/Qwen3.6-27B BF16 (≈55.6 GB on disk, 15 shards) |
| Weight bits | 4 |
| Group size | 64 |
| Quantization mode | affine |
| Effective bpw | 4.501 |
| MTP head | preserved (31 language_model.mtp.* tensors retained) |
| Format | MLX safetensors (3 × ~5 GB shards) |
| On-disk size | ~14 GB |
| Conversion command | mlx_lm convert --hf-path Qwen/Qwen3.6-27B --mlx-path out --quantize --q-bits 4 --q-group-size 64 (with PR #990's branch installed) |
| Convert wall time on M4 Max | 12.9 s |
Context window
Native context window of the base model: 262,144 tokens (declared in config.json). Memory required for the full window at fp16 KV cache is ~67 GB; on 128 GB systems that leaves >40 GB headroom above the 20 GB OS reserve. Tested up to ~9k-token prompts on M4 Max with no measured throttling.
Memory footprint (measured)
| State | Resident memory |
|---|---|
| Server idle (weights loaded) | ~14.5 GB |
| ~1k-token prompt active | ~14.5–15 GB |
| Full 262144-token KV cache (fp16, projected) | ~82 GB total (weights + KV) |
Limitations / caveats
- PR #990 is not yet merged. Until then, you need AirRunner's branch installed — stock
mlx-lmreleases will load this checkpoint but will not use the MTP head (you still get a clean 4-bit checkpoint with the same throughput asmlx-community/Qwen3.6-27B-4bit). - Cross-request degradation: see "Performance — measured" above. Open issue, not specific to this checkpoint.
- M1 / M2 Macs: if you do not quantise
mtp.fc, pass--dtype float16tomlx_lm convert; otherwise BF16 ops on M1/M2 hit a slow CPU fallback. M3/M4 are unaffected — they have native BF16 GPU support. - Multimodal: the base model is image-text-to-text. This conversion preserves the vision tower; if you only need text serving, the vision weights add disk but no inference cost (loaded but unused).
- Identity: under greedy decoding the MTP output is bit-exact equal to plain
generate_step(PR #990'stest_mtp.py). Under sampling, accepted draft tokens come from the same shared distribution as the verify pass. - Single-stream only. Numbers above are concurrency = 1. Multi-stream behaviour with MTP has not been benchmarked here.
Credits
- Qwen team — original model weights, training, and the open Apache-2.0 release.
- Luca Vaio / @AirRunner — the MTP integration in ml-explore/mlx-lm#990, the reference implementation, and the unit tests this conversion depends on.
mlx-community— the standing convention for MLX-format checkpoints (mlx_lm.convertdefaults, group_size 64, affine quant).- Apple's MLX team — the runtime.
License
Apache-2.0, inherited from the base model.
- Downloads last month
- 1,774
4-bit
Model tree for kradih/Qwen3.6-27B-MTP-4bit-MLX
Base model
Qwen/Qwen3.6-27B