Qwen3.6-27B-MTP Hybrid IQ4_KS GGUF

ik_llama.cpp is required to run this model.

This was made from Q8_0 and not directly from fp16, some accuracy might been lost due to that.

This is a "Hybrid" IQ4_KS GGUF quantization of Qwen3.6-27B that preserves the MTP (Multi-Token Prediction) layers, allowing for significantly faster text generation via speculative decoding.

Standard GGUF conversions often strip out MTP tensors to save a tiny bit of space. This model was carefully requantized from Radamanthys11/Qwen3.6-27B-MTP-Q8_0-GGUF using ik_llama.cpp to retain the MTP head while shrinking the VRAM requirements down to a highly efficient Q4 footprint.

Inference Speed (MTP vs Baseline)

Tested using llama-server on a 3090 to measure evaluation speed. Enabling Multi-Token Prediction with 1 draft token yields roughly a 16% speedup over standard inference.

Configuration Speed Setup / Flags
MTP Max 1 28.99 t/s -mtp --draft-max 1 --draft-p-min 0.0
Baseline (No MTP) 24.99 t/s (No MTP flags)
MTP Max 2 24.84 t/s -mtp --draft-max 2 --draft-p-min 0.0

Perplexity

Measured against wiki.test.raw with n_ctx=512 over 580 chunks. Tests confirm that enabling MTP decoding does not negatively impact the perplexity score:

Quant Size Without MTP PPL With MTP PPL
Hybrid IQ4_KS 16.8 GiB 6.9424 +/- 0.04574 6.9424 +/- 0.04574
Q4_K_M 15.7 GiB 7.0291 +/- 0.04648 7.0291 +/- 0.04648

Note: Lower is better.

Quantization Recipe

The custom ruleset used for the mixed-precision tensor overrides:

custom="
# SSM State Logic
blk\..*\.ssm_alpha\.weight=f32
blk\..*\.ssm_beta\.weight=f32
blk\..*\.ssm_out\.weight=q8_0

# 1. Non-linear mapping strictly for attention
blk\..*\.attn_.*\.weight=iq4_nl

# 2. Sandwich boost (First 8 / Last 8) -> iq5_ks for ALL FFN tensors
blk\.[0-7]\.ffn_.*\.weight=iq5_ks
blk\.(5[6-9]|6[0-3])\.ffn_.*\.weight=iq5_ks

# 3. Global bottleneck boost -> iq5_ks for remaining ffn_down
blk\..*\.ffn_down\.weight=iq5_ks

# 4. Fallback -> iq4_ks for remaining ffn_gate / ffn_up
blk\..*\.ffn_.*\.weight=iq4_ks

# 5. High precision anchors
token_embd\.weight=q8_0
output\.weight=q8_0
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

How This Was Made (Reproduction Steps)

This model was re-quantized directly from a Q8_0 intermediate model using --allow-requantize.

1. Generate the Imatrix from Q8_0 Note: GGML_CUDA_NO_PINNED=1 is used to prevent system RAM exhaustion on 24GB VRAM setups.

GGML_CUDA_NO_PINNED=1 ./ik_llama.cpp/build/bin/llama-imatrix \
  -m ./Qwen3.6-27B-MTP-Q8_0.gguf \
  -f /path/to/ubergarm-imatrix-calibration-corpus-v02.txt \
  -o Qwen3.6-27B-MTP-imatrix.dat \
  --ctx-size 512 \
  -t 16 \
  --fit

2. Requantize Q8_0 directly to Hybrid IQ4_KS

./ik_llama.cpp/build/bin/llama-quantize \
  --allow-requantize \
  --imatrix ./Qwen3.6-27B-MTP-imatrix.dat \
  --custom-q "$custom" \
  ./Qwen3.6-27B-MTP-Q8_0.gguf \
  ./Qwen3.6-27B-MTP-IQ4_KS.gguf \
  IQ4_KS 16

3. Test Perplexity

wget https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/resolve/main/wiki.test.raw.gz
gunzip wiki.test.raw.gz

# Standard Perplexity
./ik_llama.cpp/build/bin/llama-perplexity \
  -m ./Qwen3.6-27B-MTP-IQ4_KS.gguf \
  -f ./wiki.test.raw \
  -c 512 \
  -ngl 99

# Perplexity with MTP enabled
./ik_llama.cpp/build/bin/llama-perplexity \
  -m ./Qwen3.6-27B-MTP-IQ4_KS.gguf \
  -f ./wiki.test.raw \
  -c 512 \
  -ngl 99 \
  -mtp --draft-max 1 --draft-p-min 0.0

Quick Start Inference

Requires ik_llama.cpp. Be sure to pass the MTP flags (-mtp and --draft-max 1) to utilize the Multi-Token Prediction speedups!

./ik_llama.cpp/build/bin/llama-server \
  -m ./Qwen3.6-27B-MTP-IQ4_KS.gguf \
  -c 10000 \
  -ngl 99 \
  -mtp --draft-max 1 --draft-p-min 0.0
Downloads last month
21,707
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF

Base model

Qwen/Qwen3.6-27B
Quantized
(301)
this model