Qwen3.6-27B-OTQ-GGUF
OpenTQ TurboQuant dynamic-compatible GGUFs for Qwen/Qwen3.6-27B.
This is the stock llama.cpp release track. OpenTQ chooses the tensor-level allocation policy, but the files themselves use standard GGUF tensor types (Q3_K_M, Q4_K_M, Q5_K, Q6_K, Q8_0, F16). No custom OpenTQ runtime is required for these GGUF files.
The Hugging Face
pipeline_tagfollows the official Qwen3.6-27B card (image-text-to-text). These GGUF artifacts are validated here for local text inference with stockllama.cpp; vision tensors are not part of this text-focused release track.
Why This Release Exists
These builds target MacBook-class Apple Silicon where wall-clock time matters, especially with long prompts, large system messages and agent/tool context. The goal is not to publish another uniform quant; it is to provide a stock-compatible GGUF family where OpenTQ spends precision on the tensors that matter more for local inference.
What Is OpenTQ?
OpenTQ is an open quantization toolchain for TurboQuant-style low-bit model releases. For this GGUF track, OpenTQ does not introduce a custom file format: it audits the model tensor map, assigns standard GGUF tensor types per tensor family, validates the resulting files in stock llama.cpp, and publishes the allocation/evaluation evidence next to the model.
| Field | Value |
|---|---|
| Release track | Qwen3.6-27B-OTQ-GGUF |
| Method | OpenTQ / TurboQuant-inspired dynamic tensor allocation |
| Runtime | stock llama.cpp with Metal and FlashAttention |
| Compatibility boundary | standard GGUF only; no native OpenTQ kernel required |
| Current public variants | Q3_K_M compact, Q4_K_M balanced, and Q5_K_M quality-first |
| Validation machine | M1 Max, 8K prefill gate, bounded generation, deterministic release suites |
Paired BF16-vs-GGUF Quality Signal
These are small paired release signals, not full benchmark replacements. They use the same pinned task IDs, prompt format qwen3-no-think, deterministic decoding, and local scoring rules for BF16 and the GGUF artifacts.
BF16 sidecar: Hugging Face Jobs H200 run 69f235d2d2c8bd8662bd320e, model Qwen/Qwen3.6-27B. Reproducibility data is published in zlaabsi/Qwen3.6-27B-OTQ-GGUF-benchmarks.
| Benchmark | BF16 | Q3_K_M | Delta Q3 | Q4_K_M | Delta Q4 | Q5_K_M | Delta Q5 |
|---|---|---|---|---|---|---|---|
mmlu |
15/16 (93.8%) | 15/16 (93.8%) | +0.0% | 15/16 (93.8%) | +0.0% | 15/16 (93.8%) | +0.0% |
mmlu_pro |
13/24 (54.2%) | 13/24 (54.2%) | +0.0% | 13/24 (54.2%) | +0.0% | 13/24 (54.2%) | +0.0% |
arc |
15/16 (93.8%) | 15/16 (93.8%) | +0.0% | 15/16 (93.8%) | +0.0% | 15/16 (93.8%) | +0.0% |
hellaswag |
15/16 (93.8%) | 15/16 (93.8%) | +0.0% | 14/16 (87.5%) | -6.2% | 15/16 (93.8%) | +0.0% |
gsm8k |
6/16 (37.5%) | 5/16 (31.2%) | -6.2% | 6/16 (37.5%) | +0.0% | 6/16 (37.5%) | +0.0% |
math |
6/16 (37.5%) | 5/16 (31.2%) | -6.2% | 7/16 (43.8%) | +6.2% | 6/16 (37.5%) | +0.0% |
bbh |
18/24 (75.0%) | 18/24 (75.0%) | +0.0% | 18/24 (75.0%) | +0.0% | 19/24 (79.2%) | +4.2% |
gpqa |
0/24 (0.0%) | 0/24 (0.0%) | +0.0% | 0/24 (0.0%) | +0.0% | 0/24 (0.0%) | +0.0% |
truthfulqa |
14/16 (87.5%) | 13/16 (81.2%) | -6.2% | 13/16 (81.2%) | -6.2% | 14/16 (87.5%) | +0.0% |
winogrande |
14/16 (87.5%) | 14/16 (87.5%) | +0.0% | 14/16 (87.5%) | +0.0% | 14/16 (87.5%) | +0.0% |
drop |
13/16 (81.2%) | 13/16 (81.2%) | +0.0% | 12/16 (75.0%) | -6.2% | 11/16 (68.8%) | -12.5% |
piqa |
15/16 (93.8%) | 15/16 (93.8%) | +0.0% | 15/16 (93.8%) | +0.0% | 15/16 (93.8%) | +0.0% |
commonsenseqa |
13/16 (81.2%) | 13/16 (81.2%) | +0.0% | 13/16 (81.2%) | +0.0% | 12/16 (75.0%) | -6.2% |
TOTAL |
157/232 (67.7%) | 154/232 (66.4%) | -1.3% | 155/232 (66.8%) | -0.9% | 155/232 (66.8%) | -0.9% |
Aggregate deltas on this practical subset are small: Q3 is -1.3 points, Q4 is -0.9 points, and Q5 is -0.9 points vs BF16. Per-benchmark rows still have small-N variance and should not be used as leaderboard claims.
Official Qwen3.6-27B full-harness scores remain the baseline for model capability claims. This table measures same-subset quantization regression only.
Allocation Transparency
| Variant | Mapped tensors | F16 | Q3_K | Q4_K | Q5_K | Q6_K | Q8_0 |
|---|---|---|---|---|---|---|---|
Q3_K_M |
851 | 353 | 180 | 252 | 65 | 1 | 0 |
Q4_K_M |
851 | 353 | 0 | 180 | 237 | 80 | 1 |
Q5_K_M |
851 | 353 | 0 | 0 | 180 | 237 | 81 |
The allocation plots show where OpenTQ spends precision. For example, the compact profile pushes bulk MLP tensors lower while preserving attention anchors and output-sensitive tensors at higher precision.
Custom Allocation Policies
OpenTQ can also generate a dynamic GGUF plan from a user-defined YAML/JSON policy. This lets you customize where precision is spent without editing OpenTQ source code.
name: MY-DYN-Q4
base_ftype: Q4_K_M
target: custom 32GB Apple Silicon profile
requires_imatrix: false
category_types:
embeddings: Q6_K
lm_head: Q8_0
self_attn_proj: Q6_K
linear_attn_proj: Q5_K
linear_attn_conv: F16
mlp_proj: Q3_K
edge_layers: 2
edge_overrides:
mlp_proj: Q5_K
self_attn_proj: Q8_0
periodic_stride: 4
periodic_overrides:
self_attn_proj: Q6_K
Generate the plan:
git clone https://github.com/zlaabsi/opentq
cd opentq
uv sync
uv run opentq dynamic-gguf-plan \
--policy-file policies/qwen36-custom-dyn-q4.yaml \
--output artifacts/qwen36-my-dyn-q4 \
--llama-cpp /path/to/llama.cpp \
--source-gguf artifacts/qwen36-bf16/Qwen3.6-27B-BF16.gguf \
--target-gguf artifacts/qwen36-my-dyn-q4/Qwen3.6-27B-MY-DYN-Q4.gguf
The output directory contains plan.json, tensor-types.txt, tensor-types.annotated.tsv, and a runnable quantize.sh. The stock-compatible track supports custom allocation across standard GGUF tensor types; arbitrary new quantization kernels belong to the native OpenTQ runtime track.
See the full OpenTQ cookbook for built-in profiles, external policies, release evidence, and validation workflows.
Quantization Monitor
OpenTQ includes a terminal dashboard for long quantization batches:
uv run opentq monitor \
--root artifacts/qwen3.6-27b \
--watch \
--interval 5
Use uv run opentq status --root artifacts/qwen3.6-27b when you need machine-readable status output for automation.
Files
| File | Quant | Size | SHA256 | Target |
|---|---|---|---|---|
Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf |
Q3_K_M |
13.48 GiB | 0088e8884a0593b6720a58e2e0ab91a1dd216dfb80942b698f9ddee5dc8b3192 |
32 GB Apple Silicon first pick |
Qwen3.6-27B-OTQ-DYN-Q4_K_M.gguf |
Q4_K_M |
16.82 GiB | 6b1b9bcbb987e8861c9727488b320e90446d1610a6d3341e3c2185e7388bc2e9 |
32 GB moderate context; 48 GB+ preferred |
Qwen3.6-27B-OTQ-DYN-Q5_K_M.gguf |
Q5_K_M |
19.92 GiB | aaf270a91d943e9f26692f267aa9ccaa5359ae2084abb8ba76d84d56b660ab16 |
48 GB+ preferred; measured on M1 Max 32 GB with tight headroom |
Variant Family
| File | Quant | Size | Apple Silicon target | Role |
|---|---|---|---|---|
Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf |
Q3_K_M |
13.48 GiB | 32 GB Apple Silicon first pick | smallest public OpenTQ dynamic-compatible release |
Qwen3.6-27B-OTQ-DYN-Q4_K_M.gguf |
Q4_K_M |
16.82 GiB | 32 GB moderate context; 48 GB+ preferred | quality-balanced public release |
Qwen3.6-27B-OTQ-DYN-Q5_K_M.gguf |
Q5_K_M |
19.92 GiB | 48 GB+ preferred; measured on M1 Max 32 GB with tight headroom | quality-first public release for larger unified-memory Macs |
Naming
OTQ: OpenTQ, the release/tooling brand.TurboQuant: the quantization family and design direction.DYN: dynamic tensor-level allocation; different tensor families receive different GGUF quant types.Q3_K_M/Q4_K_M/Q5_K_M: standard GGUF quant names recognized by Hugging Face and stockllama.cpp.
Which File Should I Use?
Q3_K_M: first pick for 32 GB Apple Silicon and larger app/tool contexts.Q4_K_M: quality-balanced pick; usable on 32 GB at moderate context, more comfortable on 48 GB+.Q5_K_M: quality-first pick; measured on M1 Max 32 GB, but 48 GB+ is the practical target.
Hardware Compatibility
| Hardware | Status | Recommended artifact | Notes |
|---|---|---|---|
| M1 Max 32 GB | Measured | Q3_K_M; Q4_K_M; Q5_K_M tight |
Q5_K_M passed 8K gates but leaves limited app/headroom. |
| 32 GB Apple Silicon | Expected | Q3_K_M; Q4_K_M only with care |
Capacity guidance for M-series systems with similar usable unified memory. |
| 48 GB Apple Silicon | Expected | Q4_K_M; Q5_K_M |
Recommended floor for comfortable Q5 use. |
| 64 GB+ Apple Silicon | Expected | Q5_K_M quality-first |
Best local target for Q5 plus larger contexts and other apps. |
| 16 GB Apple Silicon | Not recommended | None | Current 27B artifacts leave too little memory headroom. |
Expected rows are capacity guidance, not measured benchmark claims.
Q5_K_M is measured on M1 Max 32 GB, but 48 GB+ is the practical recommendation for comfortable use.
Model Overview
| Base model field | Value |
|---|---|
| Base model | Qwen/Qwen3.6-27B |
| Parameter class | 27B dense model |
| HF architecture | Qwen3_5ForConditionalGeneration |
| Layer count | 64 language layers |
| Hidden size | 5120 |
| Native context | 262,144 tokens in the base model; practical local context depends on RAM, KV/cache settings and apps |
| Public GGUF modality | text inference release track |
| Runtime target | Apple Silicon Metal through stock llama.cpp |
Runtime Compatibility
llama.cpp,llama-cli,llama-server: supported.- LM Studio and Ollama local GGUF import: expected to work as standard GGUF loaders.
- OpenTQ custom runtime: not required for this repo.
- Native TurboQuant/OpenTQ tensor formats: separate release track, not mixed into this GGUF repo.
- MLX: not the target runtime for this GGUF track.
Quick Start
1. Download A GGUF
hf download zlaabsi/Qwen3.6-27B-OTQ-GGUF Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf --local-dir models/Qwen3.6-27B-OTQ-GGUF
Use Q3_K_M first on 32 GB Macs. Use Q4_K_M when you can afford the extra memory. Use Q5_K_M for quality-first local inference when headroom matters less than fidelity.
2. Build llama.cpp With Metal
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_METAL=ON -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=ON
cmake --build build -j
3. Run Locally
./build/bin/llama-cli \
-m models/Qwen3.6-27B-OTQ-GGUF/Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf \
-ngl 99 \
-fa \
-c 8192 \
--temp 0.6 \
--top-p 0.95 \
-p "<|im_start|>user\nExplain the tradeoff between prefill and decode throughput.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
4. Serve An OpenAI-Compatible API
./build/bin/llama-server \
-m models/Qwen3.6-27B-OTQ-GGUF/Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf \
-ngl 99 \
-fa \
-c 8192 \
--host 0.0.0.0 \
--port 8080
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen3.6-27b-otq","messages":[{"role":"user","content":"Give me a 3-bullet summary of OpenTQ."}],"temperature":0.6}'
llama.cpp Settings
| Setting | Recommended value | Why |
|---|---|---|
| GPU layers | -ngl 99 |
Offload all supported layers to Metal on Apple Silicon |
| FlashAttention | -fa / -fa on |
Critical for long-context prefill wall-clock |
| Context | -c 8192 first |
Validated release gate; increase only after checking memory headroom |
| Prompt format | Qwen chat template | Keep `< |
| Sampling | --temp 0.6 --top-p 0.95 |
Good default for general chat; tighten for deterministic evals |
| Server | llama-server |
Use for OpenAI-compatible local apps and agents |
Apple Silicon Guide
| Machine class | Recommendation |
|---|---|
| 32 GB MacBook Pro / Mac Studio | Prefer Q3_K_M for headroom, especially with agentic prompts and other apps open. |
| 48-64 GB Apple Silicon | Prefer Q4_K_M for balance; use Q5_K_M for quality-first local inference. |
| 96 GB+ Apple Silicon | Prefer Q5_K_M; larger native/custom candidates remain separate until runtime gates pass. |
| Agent workloads with large tool context | Measure total wall-clock time. Decode-only tok/s hides prefill cost. |
Benchmarks
| Variant | Test | Throughput | Backend | Size |
|---|---|---|---|---|
Q3_K_M |
pp8192 | 107.09 +/- 0.00 | MTL,BLAS | 13.47 GiB |
Q3_K_M |
tg128 | 10.19 +/- 0.00 | MTL,BLAS | 13.47 GiB |
Q4_K_M |
pp8192 | 106.98 +/- 0.00 | MTL,BLAS | 16.81 GiB |
Q4_K_M |
tg128 | 9.62 +/- 0.00 | MTL,BLAS | 16.81 GiB |
Q5_K_M |
pp8192 | 93.94 +/- 0.00 | MTL,BLAS | 19.91 GiB |
Q5_K_M |
tg128 | 8.87 +/- 0.00 | MTL,BLAS | 19.91 GiB |
The plots compare the quantized OTQ artifacts against each other on measured release data. Official Qwen scores are kept as a reference table, not plotted as a fake delta.
Practical Mini-Subset Quality Signals
See Paired BF16-vs-GGUF Quality Signal. The table and chart are placed near the top of this card because they are the main same-subset quantization-regression evidence.
Release Evaluation
| Variant | Suite | Passed | Pass rate | Mean latency | p95 latency |
|---|---|---|---|---|---|
Q3_K_M |
smoke | 5/5 | 1.0 | 7.605s | 22.371s |
Q3_K_M |
release | 10/10 | 1.0 | 9.325s | 26.905s |
Q4_K_M |
smoke | 5/5 | 1.0 | 8.333s | 23.826s |
Q4_K_M |
release | 10/10 | 1.0 | 9.907s | 21.395s |
Q5_K_M |
smoke | 5/5 | 1.0 | 16.046s | 34.387s |
Q5_K_M |
release | 10/10 | 1.0 | 16.955s | 34.58s |
Release Gate
| Variant | Metadata | Bounded generation | 8K llama-bench | Smoke gate | Release gate | Timestamp |
|---|---|---|---|---|---|---|
Q3_K_M |
passed | passed (24.246s) | passed (91.371s) | 5/5 | 10/10 | 2026-04-27T19:38:50.320253+00:00 |
Q4_K_M |
passed | passed (22.348s) | passed (93.163s) | 5/5 | 10/10 | 2026-04-27T19:43:25.174228+00:00 |
Q5_K_M |
passed | passed (44.272s) | passed (119.964s) | 5/5 | 10/10 | 2026-04-28T23:18:17.700281+00:00 |
Official Baseline vs OTQ Claims
| Item | Status |
|---|---|
| Official Qwen3.6-27B source scores | Imported from the official model card into benchmarks/official_qwen36_baseline.csv |
OTQ Q3_K_M / Q4_K_M / Q5_K_M runtime |
Measured with llama-bench on M1 Max |
| OTQ functional release gates | Measured with deterministic smoke and extended suites |
| Official benchmark deltas | Not claimed yet; requires running the same tasks/scoring on the GGUF artifacts |
Transparency Files
Each variant has full release evidence under evidence/<quant>/:
validation.jsonquality-eval.jsonrelease-eval.jsonopentq-plan.jsontensor-types.txttensor-types.annotated.tsvquantize-dry-run.log
Reproduce Release Evidence
git clone https://github.com/zlaabsi/opentq
cd opentq
uv sync
uv run python scripts/stage_qwen36_otq_gguf_repo.py
uv run python scripts/build_qwen36_release_report.py --repo artifacts/hf-gguf-canonical/Qwen3.6-27B-OTQ-GGUF
Run the same style of OTQ release evaluation:
LLAMA_CPP_DIR=/path/to/llama.cpp ./scripts/run_qwen36_otq_eval.sh
Run the long-context benchmark directly:
./build/bin/llama-bench \
-m models/Qwen3.6-27B-OTQ-GGUF/Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf \
-ngl 99 \
-fa on \
-p 8192 \
-n 128 \
-r 1 \
--no-warmup
- Downloads last month
- 3,688
3-bit
4-bit
5-bit
Model tree for zlaabsi/Qwen3.6-27B-OTQ-GGUF
Base model
Qwen/Qwen3.6-27B









docker model run hf.co/zlaabsi/Qwen3.6-27B-OTQ-GGUF: