llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)Qwen3.6-27B-DFlash GGUF Quantizations
This repository stages GGUF quantizations of the DFlash drafter for Qwen/Qwen3.6-27B.
These files are intended for DFlash speculative decoding with llama.cpp builds that support the dflash-draft architecture, especially spiritbuun/buun-llama-cpp.
This is not an official z-lab or Qwen release. The source drafter is z-lab/Qwen3.6-27B-DFlash, which is licensed MIT and currently notes that Qwen3.6 DFlash support is still maturing because of architecture changes including causal SWA layers. The target model is Qwen/Qwen3.6-27B, licensed Apache 2.0.
Recommended files
Based on our RTX 3090 llama.cpp/buun tests with a multimodal projector loaded:
- Best all-around Q4_K_M target setup:
Qwen3.6-27B-DFlash-IQ4_XS.gguf - Strong long-code challenger:
Qwen3.6-27B-DFlash-Q4_K_M.gguf - Avoid for this specific 3090/mmproj-loaded setup unless you have a reason:
Qwen3.6-27B-DFlash-Q8_0.gguf
Q8_0 had competitive short-code acceptance, but its extra VRAM cost did not translate into better end-to-end throughput in our llama.cpp server path.
F16 and Q6_K are included for completeness. The benchmark tables below focus on the draft quants we actually tested on the RTX 3090 server path; Q6_K has not yet been benchmarked in that setup.
Files
| File | Size | SHA256 prefix |
|---|---|---|
Qwen3.6-27B-DFlash-F16.gguf |
3310.7 MiB | f4edf54b8f83... |
Qwen3.6-27B-DFlash-IQ4_NL.gguf |
941.9 MiB | 0a8d892acea4... |
Qwen3.6-27B-DFlash-IQ4_XS.gguf |
891.1 MiB | 7807d36837ba... |
Qwen3.6-27B-DFlash-Q4_K_M.gguf |
985.2 MiB | 254d1805b6b6... |
Qwen3.6-27B-DFlash-Q5_K_M.gguf |
1169.0 MiB | 8c420ebe6d20... |
Qwen3.6-27B-DFlash-Q6_K.gguf |
1364.2 MiB | cf3b3b5629a2... |
Qwen3.6-27B-DFlash-Q8_0.gguf |
1763.8 MiB | 6799ace6a821... |
A full checksum manifest is included in SHA256SUMS.txt.
Benchmark context
These results are deployment-focused, not a universal model-quality benchmark.
Hardware and runtime:
- GPU: single RTX 3090, 24 GiB VRAM
- CPU/RAM: Kubernetes pod with 4 CPU cores and 32 GiB RAM for GPU-serving tests
- Runtime: patched
llama-serverfrom spiritbuun/buun-llama-cpp - Target model: Qwen3.6 27B GGUF, primarily
Q4_K_M - Multimodal projector: loaded during benchmark, quantized
Q6_K - KV cache:
turbo4 / turbo4 - Context:
-c 131072,-np 1 - DFlash profile:
-cd 256,--draft-max 8,--draft-min 1,--draft-p-min 0.75 - Sampling:
--temp 0.2
Important caveat: the loaded mmproj makes these numbers more conservative and more relevant to a real multimodal deployment than a text-only benchmark. If you run text-only, different CPU allocation, different target quant, different KV quant, or a different DFlash implementation, your speedups may differ.
Multimodal + DFlash compatibility note
These benchmarks were run with a small local compatibility patch applied on top of buun's fork so that a multimodal projector can stay loaded while DFlash speculative decoding remains active for text generation. Some llama.cpp/buun builds disable speculative decoding as soon as an mmproj is present; if your log contains:
speculative decoding is not supported by multimodal, it will be disabled
then DFlash is not actually active and you should not expect to reproduce the results here. Until this behavior is supported directly by the fork you are using, you need an equivalent server-side patch or you should run without --mmproj.
The benchmark prompts themselves were text-only; the mmproj was loaded to match the intended real deployment's VRAM/runtime shape. Image-containing turns may need separate validation depending on your harness and llama.cpp build.
Summary results
Baseline is Qwen3.6-27B Q4_K_M target, Q6_K mmproj loaded, turbo4/turbo4, no drafter.
| Configuration | Avg tok/s | Delta vs Q4 plain | Avg acceptance |
|---|---|---|---|
| Q4_K_M target plain baseline | 36.16 | +0.0% | n/a |
| Q4_K_M target + local IQ4_XS draft | 40.79 | +12.8% | 32.31% |
| Q4_K_M target + local Q4_K_M draft | 40.07 | +10.8% | 32.44% |
| Q4_K_M target + published Q8_0 draft | 34.08 | -5.8% | n/a |
| IQ4_XS target + local IQ4_XS draft | 45.31 | +25.3% | 34.08% |
Per-case results
Delta is relative to the matching Q4_K_M plain baseline for that case.
| Case | Configuration | Tok/s | Delta vs plain | Acceptance |
|---|---|---|---|---|
| Short code regex debug | Q4_K_M target + local IQ4_XS draft | 48.57 | +31.9% | 40.18% |
| Short code regex debug | Q4_K_M target + local Q4_K_M draft | 45.48 | +23.6% | 37.68% |
| Short code regex debug | Q4_K_M target + published Q8_0 draft | 38.45 | +4.5% | n/a |
| Multi-turn install + credentials | Q4_K_M target + local IQ4_XS draft | 37.97 | +5.1% | 28.93% |
| Multi-turn install + credentials | Q4_K_M target + local Q4_K_M draft | 36.74 | +1.7% | 28.29% |
| Multi-turn install + credentials | Q4_K_M target + published Q8_0 draft | 35.80 | -0.9% | n/a |
| Long mixed shipping-code task | Q4_K_M target + local IQ4_XS draft | 38.95 | +7.0% | 29.36% |
| Long mixed shipping-code task | Q4_K_M target + local Q4_K_M draft | 42.71 | +17.3% | 34.46% |
| Long mixed shipping-code task | Q4_K_M target + published Q8_0 draft | 34.48 | -5.3% | n/a |
| Tool-heavy follow-up synthesis | Q4_K_M target + local IQ4_XS draft | 37.67 | +6.7% | 30.77% |
| Tool-heavy follow-up synthesis | Q4_K_M target + local Q4_K_M draft | 35.36 | +0.2% | 29.33% |
| Tool-heavy follow-up synthesis | Q4_K_M target + published Q8_0 draft | 27.58 | -21.8% | n/a |
The full plotted benchmark CSVs and chart used for this model card are available in the local experiment workspace. If desired, they can be uploaded alongside the GGUFs as reproducibility artifacts.
Example llama-server command
llama-server \
-m /path/to/Qwen3.6-27B-Q4_K_M.gguf \
--mmproj /path/to/mmproj-Qwen3.6-27B-Q6_K.gguf \
-md /path/to/Qwen3.6-27B-DFlash-IQ4_XS.gguf \
--spec-type dflash \
-c 131072 -np 1 -ngl 999 \
-b 256 -ub 64 \
-cd 256 -ngld all -ctkd f16 -ctvd f16 \
--draft-max 8 --draft-min 1 --draft-p-min 0.75 \
-fa on --cache-type-k turbo4 --cache-type-v turbo4
The exact DFlash flag surface is fork/build-dependent. These GGUFs were produced and tested with buun's llama.cpp fork.
Conversion notes
The GGUFs were created from the official z-lab safetensors source using buun's convert_hf_to_gguf.py and llama-quantize. Tokenizer/config files from the target Qwen3.6 model repo were copied into the conversion source before GGUF export, matching the conversion path used in our tests.
References and credit
- Source DFlash drafter:
z-lab/Qwen3.6-27B-DFlash - Target model:
Qwen/Qwen3.6-27B - DFlash paper: DFlash: Block Diffusion for Flash Speculative Decoding
- llama.cpp fork used for conversion/testing:
spiritbuun/buun-llama-cpp - Upstream llama.cpp project:
ggml-org/llama.cpp
If you use these files, please cite the DFlash authors and follow the licenses/terms of the source drafter and target model.
DFlash citation
@article{chen2026dflash,
title = {DFlash: Block Diffusion for Flash Speculative Decoding},
author = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
journal = {arXiv preprint arXiv:2602.06036},
year = {2026}
}
- Downloads last month
- 6,960
4-bit
5-bit
6-bit
8-bit
16-bit
Model tree for Ardenzard/Qwen3.6-27B-DFlash-GGUF
Base model
z-lab/Qwen3.6-27B-DFlash
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Ardenzard/Qwen3.6-27B-DFlash-GGUF", filename="", )