How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Ardenzard/Qwen3.6-27B-DFlash-GGUF",
	filename="",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Qwen3.6-27B-DFlash GGUF Quantizations

This repository stages GGUF quantizations of the DFlash drafter for Qwen/Qwen3.6-27B. These files are intended for DFlash speculative decoding with llama.cpp builds that support the dflash-draft architecture, especially spiritbuun/buun-llama-cpp.

This is not an official z-lab or Qwen release. The source drafter is z-lab/Qwen3.6-27B-DFlash, which is licensed MIT and currently notes that Qwen3.6 DFlash support is still maturing because of architecture changes including causal SWA layers. The target model is Qwen/Qwen3.6-27B, licensed Apache 2.0.

Recommended files

Based on our RTX 3090 llama.cpp/buun tests with a multimodal projector loaded:

  • Best all-around Q4_K_M target setup: Qwen3.6-27B-DFlash-IQ4_XS.gguf
  • Strong long-code challenger: Qwen3.6-27B-DFlash-Q4_K_M.gguf
  • Avoid for this specific 3090/mmproj-loaded setup unless you have a reason: Qwen3.6-27B-DFlash-Q8_0.gguf

Q8_0 had competitive short-code acceptance, but its extra VRAM cost did not translate into better end-to-end throughput in our llama.cpp server path.

F16 and Q6_K are included for completeness. The benchmark tables below focus on the draft quants we actually tested on the RTX 3090 server path; Q6_K has not yet been benchmarked in that setup.

Files

File Size SHA256 prefix
Qwen3.6-27B-DFlash-F16.gguf 3310.7 MiB f4edf54b8f83...
Qwen3.6-27B-DFlash-IQ4_NL.gguf 941.9 MiB 0a8d892acea4...
Qwen3.6-27B-DFlash-IQ4_XS.gguf 891.1 MiB 7807d36837ba...
Qwen3.6-27B-DFlash-Q4_K_M.gguf 985.2 MiB 254d1805b6b6...
Qwen3.6-27B-DFlash-Q5_K_M.gguf 1169.0 MiB 8c420ebe6d20...
Qwen3.6-27B-DFlash-Q6_K.gguf 1364.2 MiB cf3b3b5629a2...
Qwen3.6-27B-DFlash-Q8_0.gguf 1763.8 MiB 6799ace6a821...

A full checksum manifest is included in SHA256SUMS.txt.

Benchmark context

These results are deployment-focused, not a universal model-quality benchmark.

Hardware and runtime:

  • GPU: single RTX 3090, 24 GiB VRAM
  • CPU/RAM: Kubernetes pod with 4 CPU cores and 32 GiB RAM for GPU-serving tests
  • Runtime: patched llama-server from spiritbuun/buun-llama-cpp
  • Target model: Qwen3.6 27B GGUF, primarily Q4_K_M
  • Multimodal projector: loaded during benchmark, quantized Q6_K
  • KV cache: turbo4 / turbo4
  • Context: -c 131072, -np 1
  • DFlash profile: -cd 256, --draft-max 8, --draft-min 1, --draft-p-min 0.75
  • Sampling: --temp 0.2

Important caveat: the loaded mmproj makes these numbers more conservative and more relevant to a real multimodal deployment than a text-only benchmark. If you run text-only, different CPU allocation, different target quant, different KV quant, or a different DFlash implementation, your speedups may differ.

Multimodal + DFlash compatibility note

These benchmarks were run with a small local compatibility patch applied on top of buun's fork so that a multimodal projector can stay loaded while DFlash speculative decoding remains active for text generation. Some llama.cpp/buun builds disable speculative decoding as soon as an mmproj is present; if your log contains:

speculative decoding is not supported by multimodal, it will be disabled

then DFlash is not actually active and you should not expect to reproduce the results here. Until this behavior is supported directly by the fork you are using, you need an equivalent server-side patch or you should run without --mmproj.

The benchmark prompts themselves were text-only; the mmproj was loaded to match the intended real deployment's VRAM/runtime shape. Image-containing turns may need separate validation depending on your harness and llama.cpp build.

Summary results

Baseline is Qwen3.6-27B Q4_K_M target, Q6_K mmproj loaded, turbo4/turbo4, no drafter.

Qwen3.6 DFlash benchmark chart

Configuration Avg tok/s Delta vs Q4 plain Avg acceptance
Q4_K_M target plain baseline 36.16 +0.0% n/a
Q4_K_M target + local IQ4_XS draft 40.79 +12.8% 32.31%
Q4_K_M target + local Q4_K_M draft 40.07 +10.8% 32.44%
Q4_K_M target + published Q8_0 draft 34.08 -5.8% n/a
IQ4_XS target + local IQ4_XS draft 45.31 +25.3% 34.08%

Per-case results

Delta is relative to the matching Q4_K_M plain baseline for that case.

Case Configuration Tok/s Delta vs plain Acceptance
Short code regex debug Q4_K_M target + local IQ4_XS draft 48.57 +31.9% 40.18%
Short code regex debug Q4_K_M target + local Q4_K_M draft 45.48 +23.6% 37.68%
Short code regex debug Q4_K_M target + published Q8_0 draft 38.45 +4.5% n/a
Multi-turn install + credentials Q4_K_M target + local IQ4_XS draft 37.97 +5.1% 28.93%
Multi-turn install + credentials Q4_K_M target + local Q4_K_M draft 36.74 +1.7% 28.29%
Multi-turn install + credentials Q4_K_M target + published Q8_0 draft 35.80 -0.9% n/a
Long mixed shipping-code task Q4_K_M target + local IQ4_XS draft 38.95 +7.0% 29.36%
Long mixed shipping-code task Q4_K_M target + local Q4_K_M draft 42.71 +17.3% 34.46%
Long mixed shipping-code task Q4_K_M target + published Q8_0 draft 34.48 -5.3% n/a
Tool-heavy follow-up synthesis Q4_K_M target + local IQ4_XS draft 37.67 +6.7% 30.77%
Tool-heavy follow-up synthesis Q4_K_M target + local Q4_K_M draft 35.36 +0.2% 29.33%
Tool-heavy follow-up synthesis Q4_K_M target + published Q8_0 draft 27.58 -21.8% n/a

The full plotted benchmark CSVs and chart used for this model card are available in the local experiment workspace. If desired, they can be uploaded alongside the GGUFs as reproducibility artifacts.

Example llama-server command

llama-server \
  -m /path/to/Qwen3.6-27B-Q4_K_M.gguf \
  --mmproj /path/to/mmproj-Qwen3.6-27B-Q6_K.gguf \
  -md /path/to/Qwen3.6-27B-DFlash-IQ4_XS.gguf \
  --spec-type dflash \
  -c 131072 -np 1 -ngl 999 \
  -b 256 -ub 64 \
  -cd 256 -ngld all -ctkd f16 -ctvd f16 \
  --draft-max 8 --draft-min 1 --draft-p-min 0.75 \
  -fa on --cache-type-k turbo4 --cache-type-v turbo4

The exact DFlash flag surface is fork/build-dependent. These GGUFs were produced and tested with buun's llama.cpp fork.

Conversion notes

The GGUFs were created from the official z-lab safetensors source using buun's convert_hf_to_gguf.py and llama-quantize. Tokenizer/config files from the target Qwen3.6 model repo were copied into the conversion source before GGUF export, matching the conversion path used in our tests.

References and credit

If you use these files, please cite the DFlash authors and follow the licenses/terms of the source drafter and target model.

DFlash citation

@article{chen2026dflash,
  title   = {DFlash: Block Diffusion for Flash Speculative Decoding},
  author  = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
  journal = {arXiv preprint arXiv:2602.06036},
  year    = {2026}
}
Downloads last month
6,960
GGUF
Model size
2B params
Architecture
dflash-draft
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Ardenzard/Qwen3.6-27B-DFlash-GGUF

Quantized
(4)
this model

Paper for Ardenzard/Qwen3.6-27B-DFlash-GGUF