Instructions to use Ardenzard/Qwen3.6-27B-DFlash-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Ardenzard/Qwen3.6-27B-DFlash-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Ardenzard/Qwen3.6-27B-DFlash-GGUF",
	filename="Qwen3.6-27B-DFlash-F16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use Ardenzard/Qwen3.6-27B-DFlash-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Ardenzard/Qwen3.6-27B-DFlash-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Ardenzard/Qwen3.6-27B-DFlash-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Ardenzard/Qwen3.6-27B-DFlash-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Ardenzard/Qwen3.6-27B-DFlash-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Ardenzard/Qwen3.6-27B-DFlash-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf Ardenzard/Qwen3.6-27B-DFlash-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Ardenzard/Qwen3.6-27B-DFlash-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Ardenzard/Qwen3.6-27B-DFlash-GGUF:Q4_K_M

Use Docker

docker model run hf.co/Ardenzard/Qwen3.6-27B-DFlash-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use Ardenzard/Qwen3.6-27B-DFlash-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Ardenzard/Qwen3.6-27B-DFlash-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Ardenzard/Qwen3.6-27B-DFlash-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Ardenzard/Qwen3.6-27B-DFlash-GGUF:Q4_K_M

Ollama
How to use Ardenzard/Qwen3.6-27B-DFlash-GGUF with Ollama:
```
ollama run hf.co/Ardenzard/Qwen3.6-27B-DFlash-GGUF:Q4_K_M
```

Unsloth Studio new

How to use Ardenzard/Qwen3.6-27B-DFlash-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Ardenzard/Qwen3.6-27B-DFlash-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Ardenzard/Qwen3.6-27B-DFlash-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Ardenzard/Qwen3.6-27B-DFlash-GGUF to start chatting

Pi new

How to use Ardenzard/Qwen3.6-27B-DFlash-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Ardenzard/Qwen3.6-27B-DFlash-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Ardenzard/Qwen3.6-27B-DFlash-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Ardenzard/Qwen3.6-27B-DFlash-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Ardenzard/Qwen3.6-27B-DFlash-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Ardenzard/Qwen3.6-27B-DFlash-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use Ardenzard/Qwen3.6-27B-DFlash-GGUF with Docker Model Runner:
```
docker model run hf.co/Ardenzard/Qwen3.6-27B-DFlash-GGUF:Q4_K_M
```

Lemonade

How to use Ardenzard/Qwen3.6-27B-DFlash-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Ardenzard/Qwen3.6-27B-DFlash-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Qwen3.6-27B-DFlash-GGUF-Q4_K_M

List all available models

lemonade list

Qwen3.6-27B-DFlash GGUF Quantizations

This repository stages GGUF quantizations of the DFlash drafter for Qwen/Qwen3.6-27B. These files are intended for DFlash speculative decoding with llama.cpp builds that support the dflash-draft architecture, especially spiritbuun/buun-llama-cpp.

This is not an official z-lab or Qwen release. The source drafter is z-lab/Qwen3.6-27B-DFlash, which is licensed MIT and currently notes that Qwen3.6 DFlash support is still maturing because of architecture changes including causal SWA layers. The target model is Qwen/Qwen3.6-27B, licensed Apache 2.0.

Recommended files

Based on our RTX 3090 llama.cpp/buun tests with a multimodal projector loaded:

Best all-around Q4_K_M target setup: Qwen3.6-27B-DFlash-IQ4_XS.gguf
Strong long-code challenger: Qwen3.6-27B-DFlash-Q4_K_M.gguf
Avoid for this specific 3090/mmproj-loaded setup unless you have a reason: Qwen3.6-27B-DFlash-Q8_0.gguf

Q8_0 had competitive short-code acceptance, but its extra VRAM cost did not translate into better end-to-end throughput in our llama.cpp server path.

F16 and Q6_K are included for completeness. The benchmark tables below focus on the draft quants we actually tested on the RTX 3090 server path; Q6_K has not yet been benchmarked in that setup.

Files

File	Size	SHA256 prefix
`Qwen3.6-27B-DFlash-F16.gguf`	3310.7 MiB	`f4edf54b8f83...`
`Qwen3.6-27B-DFlash-IQ4_NL.gguf`	941.9 MiB	`0a8d892acea4...`
`Qwen3.6-27B-DFlash-IQ4_XS.gguf`	891.1 MiB	`7807d36837ba...`
`Qwen3.6-27B-DFlash-Q4_K_M.gguf`	985.2 MiB	`254d1805b6b6...`
`Qwen3.6-27B-DFlash-Q5_K_M.gguf`	1169.0 MiB	`8c420ebe6d20...`
`Qwen3.6-27B-DFlash-Q6_K.gguf`	1364.2 MiB	`cf3b3b5629a2...`
`Qwen3.6-27B-DFlash-Q8_0.gguf`	1763.8 MiB	`6799ace6a821...`

A full checksum manifest is included in SHA256SUMS.txt.

Benchmark context

These results are deployment-focused, not a universal model-quality benchmark.

Hardware and runtime:

GPU: single RTX 3090, 24 GiB VRAM
CPU/RAM: Kubernetes pod with 4 CPU cores and 32 GiB RAM for GPU-serving tests
Runtime: patched llama-server from spiritbuun/buun-llama-cpp
Target model: Qwen3.6 27B GGUF, primarily Q4_K_M
Multimodal projector: loaded during benchmark, quantized Q6_K
KV cache: turbo4 / turbo4
Context: -c 131072, -np 1
DFlash profile: -cd 256, --draft-max 8, --draft-min 1, --draft-p-min 0.75
Sampling: --temp 0.2

Important caveat: the loaded mmproj makes these numbers more conservative and more relevant to a real multimodal deployment than a text-only benchmark. If you run text-only, different CPU allocation, different target quant, different KV quant, or a different DFlash implementation, your speedups may differ.

Multimodal + DFlash compatibility note

These benchmarks were run with a small local compatibility patch applied on top of buun's fork so that a multimodal projector can stay loaded while DFlash speculative decoding remains active for text generation. Some llama.cpp/buun builds disable speculative decoding as soon as an mmproj is present; if your log contains:

speculative decoding is not supported by multimodal, it will be disabled

then DFlash is not actually active and you should not expect to reproduce the results here. Until this behavior is supported directly by the fork you are using, you need an equivalent server-side patch or you should run without --mmproj.

The benchmark prompts themselves were text-only; the mmproj was loaded to match the intended real deployment's VRAM/runtime shape. Image-containing turns may need separate validation depending on your harness and llama.cpp build.

Summary results

Baseline is Qwen3.6-27B Q4_K_M target, Q6_K mmproj loaded, turbo4/turbo4, no drafter.

Configuration	Avg tok/s	Delta vs Q4 plain	Avg acceptance
Q4_K_M target plain baseline	36.16	+0.0%	n/a
Q4_K_M target + local IQ4_XS draft	40.79	+12.8%	32.31%
Q4_K_M target + local Q4_K_M draft	40.07	+10.8%	32.44%
Q4_K_M target + published Q8_0 draft	34.08	-5.8%	n/a
IQ4_XS target + local IQ4_XS draft	45.31	+25.3%	34.08%

Per-case results

Delta is relative to the matching Q4_K_M plain baseline for that case.

Case	Configuration	Tok/s	Delta vs plain	Acceptance
Short code regex debug	Q4_K_M target + local IQ4_XS draft	48.57	+31.9%	40.18%
Short code regex debug	Q4_K_M target + local Q4_K_M draft	45.48	+23.6%	37.68%
Short code regex debug	Q4_K_M target + published Q8_0 draft	38.45	+4.5%	n/a
Multi-turn install + credentials	Q4_K_M target + local IQ4_XS draft	37.97	+5.1%	28.93%
Multi-turn install + credentials	Q4_K_M target + local Q4_K_M draft	36.74	+1.7%	28.29%
Multi-turn install + credentials	Q4_K_M target + published Q8_0 draft	35.80	-0.9%	n/a
Long mixed shipping-code task	Q4_K_M target + local IQ4_XS draft	38.95	+7.0%	29.36%
Long mixed shipping-code task	Q4_K_M target + local Q4_K_M draft	42.71	+17.3%	34.46%
Long mixed shipping-code task	Q4_K_M target + published Q8_0 draft	34.48	-5.3%	n/a
Tool-heavy follow-up synthesis	Q4_K_M target + local IQ4_XS draft	37.67	+6.7%	30.77%
Tool-heavy follow-up synthesis	Q4_K_M target + local Q4_K_M draft	35.36	+0.2%	29.33%
Tool-heavy follow-up synthesis	Q4_K_M target + published Q8_0 draft	27.58	-21.8%	n/a

The full plotted benchmark CSVs and chart used for this model card are available in the local experiment workspace. If desired, they can be uploaded alongside the GGUFs as reproducibility artifacts.

Example llama-server command

llama-server \
  -m /path/to/Qwen3.6-27B-Q4_K_M.gguf \
  --mmproj /path/to/mmproj-Qwen3.6-27B-Q6_K.gguf \
  -md /path/to/Qwen3.6-27B-DFlash-IQ4_XS.gguf \
  --spec-type dflash \
  -c 131072 -np 1 -ngl 999 \
  -b 256 -ub 64 \
  -cd 256 -ngld all -ctkd f16 -ctvd f16 \
  --draft-max 8 --draft-min 1 --draft-p-min 0.75 \
  -fa on --cache-type-k turbo4 --cache-type-v turbo4

The exact DFlash flag surface is fork/build-dependent. These GGUFs were produced and tested with buun's llama.cpp fork.

Conversion notes

The GGUFs were created from the official z-lab safetensors source using buun's convert_hf_to_gguf.py and llama-quantize. Tokenizer/config files from the target Qwen3.6 model repo were copied into the conversion source before GGUF export, matching the conversion path used in our tests.

References and credit

Source DFlash drafter: z-lab/Qwen3.6-27B-DFlash
Target model: Qwen/Qwen3.6-27B
DFlash paper: DFlash: Block Diffusion for Flash Speculative Decoding
llama.cpp fork used for conversion/testing: spiritbuun/buun-llama-cpp
Upstream llama.cpp project: ggml-org/llama.cpp

If you use these files, please cite the DFlash authors and follow the licenses/terms of the source drafter and target model.

DFlash citation

@article{chen2026dflash,
  title   = {DFlash: Block Diffusion for Flash Speculative Decoding},
  author  = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
  journal = {arXiv preprint arXiv:2602.06036},
  year    = {2026}
}

Downloads last month: 6,960

GGUF

Model size

2B params

Architecture

dflash-draft

Hardware compatibility

4-bit

5-bit

6-bit

8-bit

16-bit

Model tree for Ardenzard/Qwen3.6-27B-DFlash-GGUF

Base model

z-lab/Qwen3.6-27B-DFlash

Quantized

(4)

this model

Paper for Ardenzard/Qwen3.6-27B-DFlash-GGUF

DFlash: Block Diffusion for Flash Speculative Decoding

Paper • 2602.06036 • Published Feb 5 • 79