Qwen3.6-27B-MTP Hybrid IQ4_KS GGUF
ik_llama.cpp is required to run this model.
This was made from Q8_0 and not directly from fp16, some accuracy might been lost due to that.
This is a "Hybrid" IQ4_KS GGUF quantization of Qwen3.6-27B that preserves the MTP (Multi-Token Prediction) layers, allowing for significantly faster text generation via speculative decoding.
Standard GGUF conversions often strip out MTP tensors to save a tiny bit of space. This model was carefully requantized from Radamanthys11/Qwen3.6-27B-MTP-Q8_0-GGUF using ik_llama.cpp to retain the MTP head while shrinking the VRAM requirements down to a highly efficient Q4 footprint.
Inference Speed (MTP vs Baseline)
Tested using llama-server on a 3090 to measure evaluation speed. Enabling Multi-Token Prediction with 1 draft token yields roughly a 16% speedup over standard inference.
| Configuration | Speed | Setup / Flags |
|---|---|---|
| MTP Max 1 | 28.99 t/s | -mtp --draft-max 1 --draft-p-min 0.0 |
| Baseline (No MTP) | 24.99 t/s | (No MTP flags) |
| MTP Max 2 | 24.84 t/s | -mtp --draft-max 2 --draft-p-min 0.0 |
Perplexity
Measured against wiki.test.raw with n_ctx=512 over 580 chunks. Tests confirm that enabling MTP decoding does not negatively impact the perplexity score:
| Quant | Size | Without MTP PPL | With MTP PPL |
|---|---|---|---|
| Hybrid IQ4_KS | 16.8 GiB | 6.9424 +/- 0.04574 | 6.9424 +/- 0.04574 |
| Q4_K_M | 15.7 GiB | 7.0291 +/- 0.04648 | 7.0291 +/- 0.04648 |
Note: Lower is better.
Quantization Recipe
The custom ruleset used for the mixed-precision tensor overrides:
custom="
# SSM State Logic
blk\..*\.ssm_alpha\.weight=f32
blk\..*\.ssm_beta\.weight=f32
blk\..*\.ssm_out\.weight=q8_0
# 1. Non-linear mapping strictly for attention
blk\..*\.attn_.*\.weight=iq4_nl
# 2. Sandwich boost (First 8 / Last 8) -> iq5_ks for ALL FFN tensors
blk\.[0-7]\.ffn_.*\.weight=iq5_ks
blk\.(5[6-9]|6[0-3])\.ffn_.*\.weight=iq5_ks
# 3. Global bottleneck boost -> iq5_ks for remaining ffn_down
blk\..*\.ffn_down\.weight=iq5_ks
# 4. Fallback -> iq4_ks for remaining ffn_gate / ffn_up
blk\..*\.ffn_.*\.weight=iq4_ks
# 5. High precision anchors
token_embd\.weight=q8_0
output\.weight=q8_0
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
How This Was Made (Reproduction Steps)
This model was re-quantized directly from a Q8_0 intermediate model using --allow-requantize.
1. Generate the Imatrix from Q8_0
Note: GGML_CUDA_NO_PINNED=1 is used to prevent system RAM exhaustion on 24GB VRAM setups.
GGML_CUDA_NO_PINNED=1 ./ik_llama.cpp/build/bin/llama-imatrix \
-m ./Qwen3.6-27B-MTP-Q8_0.gguf \
-f /path/to/ubergarm-imatrix-calibration-corpus-v02.txt \
-o Qwen3.6-27B-MTP-imatrix.dat \
--ctx-size 512 \
-t 16 \
--fit
2. Requantize Q8_0 directly to Hybrid IQ4_KS
./ik_llama.cpp/build/bin/llama-quantize \
--allow-requantize \
--imatrix ./Qwen3.6-27B-MTP-imatrix.dat \
--custom-q "$custom" \
./Qwen3.6-27B-MTP-Q8_0.gguf \
./Qwen3.6-27B-MTP-IQ4_KS.gguf \
IQ4_KS 16
3. Test Perplexity
wget https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/resolve/main/wiki.test.raw.gz
gunzip wiki.test.raw.gz
# Standard Perplexity
./ik_llama.cpp/build/bin/llama-perplexity \
-m ./Qwen3.6-27B-MTP-IQ4_KS.gguf \
-f ./wiki.test.raw \
-c 512 \
-ngl 99
# Perplexity with MTP enabled
./ik_llama.cpp/build/bin/llama-perplexity \
-m ./Qwen3.6-27B-MTP-IQ4_KS.gguf \
-f ./wiki.test.raw \
-c 512 \
-ngl 99 \
-mtp --draft-max 1 --draft-p-min 0.0
Quick Start Inference
Requires ik_llama.cpp. Be sure to pass the MTP flags (-mtp and --draft-max 1) to utilize the Multi-Token Prediction speedups!
./ik_llama.cpp/build/bin/llama-server \
-m ./Qwen3.6-27B-MTP-IQ4_KS.gguf \
-c 10000 \
-ngl 99 \
-mtp --draft-max 1 --draft-p-min 0.0
- Downloads last month
- 21,707
Model tree for RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF
Base model
Qwen/Qwen3.6-27B