Instructions to use RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF",
	filename="Qwen3.6-27B-MTP-IQ4_KS.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF
# Run inference directly in the terminal:
llama-cli -hf RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF
# Run inference directly in the terminal:
llama-cli -hf RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF
# Run inference directly in the terminal:
./llama-cli -hf RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF
# Run inference directly in the terminal:
./build/bin/llama-cli -hf RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF

Use Docker

docker model run hf.co/RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF

LM Studio
Jan

vLLM

How to use RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF

Ollama
How to use RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF with Ollama:
```
ollama run hf.co/RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF
```

Unsloth Studio new

How to use RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF to start chatting

Pi new

How to use RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF

Run Hermes

hermes

Docker Model Runner
How to use RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF with Docker Model Runner:
```
docker model run hf.co/RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF
```

Lemonade

How to use RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF

Run and chat with the model

lemonade run user.Qwen3.6-27B-MTP-IQ4_KS-GGUF-{{QUANT_TAG}}

List all available models

lemonade list

Qwen3.6-27B-MTP Hybrid IQ4_KS GGUF

ik_llama.cpp is required to run this model.

This was made from Q8_0 and not directly from fp16, some accuracy might been lost due to that.

This is a "Hybrid" IQ4_KS GGUF quantization of Qwen3.6-27B that preserves the MTP (Multi-Token Prediction) layers, allowing for significantly faster text generation via speculative decoding.

Standard GGUF conversions often strip out MTP tensors to save a tiny bit of space. This model was carefully requantized from Radamanthys11/Qwen3.6-27B-MTP-Q8_0-GGUF using ik_llama.cpp to retain the MTP head while shrinking the VRAM requirements down to a highly efficient Q4 footprint.

Inference Speed (MTP vs Baseline)

Tested using llama-server on a 3090 to measure evaluation speed. Enabling Multi-Token Prediction with 1 draft token yields roughly a 16% speedup over standard inference.

Configuration	Speed	Setup / Flags
MTP Max 1	28.99 t/s	`-mtp --draft-max 1 --draft-p-min 0.0`
Baseline (No MTP)	24.99 t/s	(No MTP flags)
MTP Max 2	24.84 t/s	`-mtp --draft-max 2 --draft-p-min 0.0`

Perplexity

Measured against wiki.test.raw with n_ctx=512 over 580 chunks. Tests confirm that enabling MTP decoding does not negatively impact the perplexity score:

Quant	Size	Without MTP PPL	With MTP PPL
Hybrid IQ4_KS	16.8 GiB	6.9424 +/- 0.04574	6.9424 +/- 0.04574
Q4_K_M	15.7 GiB	7.0291 +/- 0.04648	7.0291 +/- 0.04648

Note: Lower is better.

Quantization Recipe

The custom ruleset used for the mixed-precision tensor overrides:

custom="
# SSM State Logic
blk\..*\.ssm_alpha\.weight=f32
blk\..*\.ssm_beta\.weight=f32
blk\..*\.ssm_out\.weight=q8_0

# 1. Non-linear mapping strictly for attention
blk\..*\.attn_.*\.weight=iq4_nl

# 2. Sandwich boost (First 8 / Last 8) -> iq5_ks for ALL FFN tensors
blk\.[0-7]\.ffn_.*\.weight=iq5_ks
blk\.(5[6-9]|6[0-3])\.ffn_.*\.weight=iq5_ks

# 3. Global bottleneck boost -> iq5_ks for remaining ffn_down
blk\..*\.ffn_down\.weight=iq5_ks

# 4. Fallback -> iq4_ks for remaining ffn_gate / ffn_up
blk\..*\.ffn_.*\.weight=iq4_ks

# 5. High precision anchors
token_embd\.weight=q8_0
output\.weight=q8_0
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

How This Was Made (Reproduction Steps)

This model was re-quantized directly from a Q8_0 intermediate model using --allow-requantize.

1. Generate the Imatrix from Q8_0 Note: GGML_CUDA_NO_PINNED=1 is used to prevent system RAM exhaustion on 24GB VRAM setups.

GGML_CUDA_NO_PINNED=1 ./ik_llama.cpp/build/bin/llama-imatrix \
  -m ./Qwen3.6-27B-MTP-Q8_0.gguf \
  -f /path/to/ubergarm-imatrix-calibration-corpus-v02.txt \
  -o Qwen3.6-27B-MTP-imatrix.dat \
  --ctx-size 512 \
  -t 16 \
  --fit

2. Requantize Q8_0 directly to Hybrid IQ4_KS

./ik_llama.cpp/build/bin/llama-quantize \
  --allow-requantize \
  --imatrix ./Qwen3.6-27B-MTP-imatrix.dat \
  --custom-q "$custom" \
  ./Qwen3.6-27B-MTP-Q8_0.gguf \
  ./Qwen3.6-27B-MTP-IQ4_KS.gguf \
  IQ4_KS 16

3. Test Perplexity

wget https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/resolve/main/wiki.test.raw.gz
gunzip wiki.test.raw.gz

# Standard Perplexity
./ik_llama.cpp/build/bin/llama-perplexity \
  -m ./Qwen3.6-27B-MTP-IQ4_KS.gguf \
  -f ./wiki.test.raw \
  -c 512 \
  -ngl 99

# Perplexity with MTP enabled
./ik_llama.cpp/build/bin/llama-perplexity \
  -m ./Qwen3.6-27B-MTP-IQ4_KS.gguf \
  -f ./wiki.test.raw \
  -c 512 \
  -ngl 99 \
  -mtp --draft-max 1 --draft-p-min 0.0

Quick Start Inference

Requires ik_llama.cpp. Be sure to pass the MTP flags (-mtp and --draft-max 1) to utilize the Multi-Token Prediction speedups!

./ik_llama.cpp/build/bin/llama-server \
  -m ./Qwen3.6-27B-MTP-IQ4_KS.gguf \
  -c 10000 \
  -ngl 99 \
  -mtp --draft-max 1 --draft-p-min 0.0

Downloads last month: 21,707

GGUF

Model size

27B params

Architecture

qwen35

Hardware compatibility

Model tree for RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF

Base model

Qwen/Qwen3.6-27B

Quantized

(301)

this model