How to use from
Docker Model Runner
docker model run hf.co/zlaabsi/Qwen3.6-27B-OTQ-GGUF:
Quick Links

Qwen3.6-27B-OTQ-GGUF

OpenTQ TurboQuant Qwen3.6 banner

GGUF OpenTQ Apple Silicon Release gate Base model

OpenTQ TurboQuant dynamic-compatible GGUFs for Qwen/Qwen3.6-27B.

This is the stock llama.cpp release track. OpenTQ chooses the tensor-level allocation policy, but the files themselves use standard GGUF tensor types (Q3_K_M, Q4_K_M, Q5_K, Q6_K, Q8_0, F16). No custom OpenTQ runtime is required for these GGUF files.

The Hugging Face pipeline_tag follows the official Qwen3.6-27B card (image-text-to-text). These GGUF artifacts are validated here for local text inference with stock llama.cpp; vision tensors are not part of this text-focused release track.

Why This Release Exists

These builds target MacBook-class Apple Silicon where wall-clock time matters, especially with long prompts, large system messages and agent/tool context. The goal is not to publish another uniform quant; it is to provide a stock-compatible GGUF family where OpenTQ spends precision on the tensors that matter more for local inference.

What Is OpenTQ?

OpenTQ is an open quantization toolchain for TurboQuant-style low-bit model releases. For this GGUF track, OpenTQ does not introduce a custom file format: it audits the model tensor map, assigns standard GGUF tensor types per tensor family, validates the resulting files in stock llama.cpp, and publishes the allocation/evaluation evidence next to the model.

Field Value
Release track Qwen3.6-27B-OTQ-GGUF
Method OpenTQ / TurboQuant-inspired dynamic tensor allocation
Runtime stock llama.cpp with Metal and FlashAttention
Compatibility boundary standard GGUF only; no native OpenTQ kernel required
Current public variants Q3_K_M compact, Q4_K_M balanced, and Q5_K_M quality-first
Validation machine M1 Max, 8K prefill gate, bounded generation, deterministic release suites

Paired BF16-vs-GGUF Quality Signal

These are small paired release signals, not full benchmark replacements. They use the same pinned task IDs, prompt format qwen3-no-think, deterministic decoding, and local scoring rules for BF16 and the GGUF artifacts.

BF16 sidecar: Hugging Face Jobs H200 run 69f235d2d2c8bd8662bd320e, model Qwen/Qwen3.6-27B. Reproducibility data is published in zlaabsi/Qwen3.6-27B-OTQ-GGUF-benchmarks.

Paired BF16 vs GGUF quantization deltas

Benchmark BF16 Q3_K_M Delta Q3 Q4_K_M Delta Q4 Q5_K_M Delta Q5
mmlu 15/16 (93.8%) 15/16 (93.8%) +0.0% 15/16 (93.8%) +0.0% 15/16 (93.8%) +0.0%
mmlu_pro 13/24 (54.2%) 13/24 (54.2%) +0.0% 13/24 (54.2%) +0.0% 13/24 (54.2%) +0.0%
arc 15/16 (93.8%) 15/16 (93.8%) +0.0% 15/16 (93.8%) +0.0% 15/16 (93.8%) +0.0%
hellaswag 15/16 (93.8%) 15/16 (93.8%) +0.0% 14/16 (87.5%) -6.2% 15/16 (93.8%) +0.0%
gsm8k 6/16 (37.5%) 5/16 (31.2%) -6.2% 6/16 (37.5%) +0.0% 6/16 (37.5%) +0.0%
math 6/16 (37.5%) 5/16 (31.2%) -6.2% 7/16 (43.8%) +6.2% 6/16 (37.5%) +0.0%
bbh 18/24 (75.0%) 18/24 (75.0%) +0.0% 18/24 (75.0%) +0.0% 19/24 (79.2%) +4.2%
gpqa 0/24 (0.0%) 0/24 (0.0%) +0.0% 0/24 (0.0%) +0.0% 0/24 (0.0%) +0.0%
truthfulqa 14/16 (87.5%) 13/16 (81.2%) -6.2% 13/16 (81.2%) -6.2% 14/16 (87.5%) +0.0%
winogrande 14/16 (87.5%) 14/16 (87.5%) +0.0% 14/16 (87.5%) +0.0% 14/16 (87.5%) +0.0%
drop 13/16 (81.2%) 13/16 (81.2%) +0.0% 12/16 (75.0%) -6.2% 11/16 (68.8%) -12.5%
piqa 15/16 (93.8%) 15/16 (93.8%) +0.0% 15/16 (93.8%) +0.0% 15/16 (93.8%) +0.0%
commonsenseqa 13/16 (81.2%) 13/16 (81.2%) +0.0% 13/16 (81.2%) +0.0% 12/16 (75.0%) -6.2%
TOTAL 157/232 (67.7%) 154/232 (66.4%) -1.3% 155/232 (66.8%) -0.9% 155/232 (66.8%) -0.9%

Aggregate deltas on this practical subset are small: Q3 is -1.3 points, Q4 is -0.9 points, and Q5 is -0.9 points vs BF16. Per-benchmark rows still have small-N variance and should not be used as leaderboard claims.

Official Qwen3.6-27B full-harness scores remain the baseline for model capability claims. This table measures same-subset quantization regression only.

Allocation Transparency

Variant Mapped tensors F16 Q3_K Q4_K Q5_K Q6_K Q8_0
Q3_K_M 851 353 180 252 65 1 0
Q4_K_M 851 353 0 180 237 80 1
Q5_K_M 851 353 0 0 180 237 81

Tensor allocation

Allocation policy

The allocation plots show where OpenTQ spends precision. For example, the compact profile pushes bulk MLP tensors lower while preserving attention anchors and output-sensitive tensors at higher precision.

Custom Allocation Policies

OpenTQ can also generate a dynamic GGUF plan from a user-defined YAML/JSON policy. This lets you customize where precision is spent without editing OpenTQ source code.

name: MY-DYN-Q4
base_ftype: Q4_K_M
target: custom 32GB Apple Silicon profile
requires_imatrix: false

category_types:
  embeddings: Q6_K
  lm_head: Q8_0
  self_attn_proj: Q6_K
  linear_attn_proj: Q5_K
  linear_attn_conv: F16
  mlp_proj: Q3_K

edge_layers: 2
edge_overrides:
  mlp_proj: Q5_K
  self_attn_proj: Q8_0

periodic_stride: 4
periodic_overrides:
  self_attn_proj: Q6_K

Generate the plan:

git clone https://github.com/zlaabsi/opentq
cd opentq
uv sync

uv run opentq dynamic-gguf-plan \
  --policy-file policies/qwen36-custom-dyn-q4.yaml \
  --output artifacts/qwen36-my-dyn-q4 \
  --llama-cpp /path/to/llama.cpp \
  --source-gguf artifacts/qwen36-bf16/Qwen3.6-27B-BF16.gguf \
  --target-gguf artifacts/qwen36-my-dyn-q4/Qwen3.6-27B-MY-DYN-Q4.gguf

The output directory contains plan.json, tensor-types.txt, tensor-types.annotated.tsv, and a runnable quantize.sh. The stock-compatible track supports custom allocation across standard GGUF tensor types; arbitrary new quantization kernels belong to the native OpenTQ runtime track.

See the full OpenTQ cookbook for built-in profiles, external policies, release evidence, and validation workflows.

Quantization Monitor

OpenTQ includes a terminal dashboard for long quantization batches:

uv run opentq monitor \
  --root artifacts/qwen3.6-27b \
  --watch \
  --interval 5

OpenTQ quantization monitor

Use uv run opentq status --root artifacts/qwen3.6-27b when you need machine-readable status output for automation.

Files

File Quant Size SHA256 Target
Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf Q3_K_M 13.48 GiB 0088e8884a0593b6720a58e2e0ab91a1dd216dfb80942b698f9ddee5dc8b3192 32 GB Apple Silicon first pick
Qwen3.6-27B-OTQ-DYN-Q4_K_M.gguf Q4_K_M 16.82 GiB 6b1b9bcbb987e8861c9727488b320e90446d1610a6d3341e3c2185e7388bc2e9 32 GB moderate context; 48 GB+ preferred
Qwen3.6-27B-OTQ-DYN-Q5_K_M.gguf Q5_K_M 19.92 GiB aaf270a91d943e9f26692f267aa9ccaa5359ae2084abb8ba76d84d56b660ab16 48 GB+ preferred; measured on M1 Max 32 GB with tight headroom

Variant Family

File Quant Size Apple Silicon target Role
Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf Q3_K_M 13.48 GiB 32 GB Apple Silicon first pick smallest public OpenTQ dynamic-compatible release
Qwen3.6-27B-OTQ-DYN-Q4_K_M.gguf Q4_K_M 16.82 GiB 32 GB moderate context; 48 GB+ preferred quality-balanced public release
Qwen3.6-27B-OTQ-DYN-Q5_K_M.gguf Q5_K_M 19.92 GiB 48 GB+ preferred; measured on M1 Max 32 GB with tight headroom quality-first public release for larger unified-memory Macs

Naming

  • OTQ: OpenTQ, the release/tooling brand.
  • TurboQuant: the quantization family and design direction.
  • DYN: dynamic tensor-level allocation; different tensor families receive different GGUF quant types.
  • Q3_K_M / Q4_K_M / Q5_K_M: standard GGUF quant names recognized by Hugging Face and stock llama.cpp.

Which File Should I Use?

  • Q3_K_M: first pick for 32 GB Apple Silicon and larger app/tool contexts.
  • Q4_K_M: quality-balanced pick; usable on 32 GB at moderate context, more comfortable on 48 GB+.
  • Q5_K_M: quality-first pick; measured on M1 Max 32 GB, but 48 GB+ is the practical target.

Hardware Compatibility

Hardware Status Recommended artifact Notes
M1 Max 32 GB Measured Q3_K_M; Q4_K_M; Q5_K_M tight Q5_K_M passed 8K gates but leaves limited app/headroom.
32 GB Apple Silicon Expected Q3_K_M; Q4_K_M only with care Capacity guidance for M-series systems with similar usable unified memory.
48 GB Apple Silicon Expected Q4_K_M; Q5_K_M Recommended floor for comfortable Q5 use.
64 GB+ Apple Silicon Expected Q5_K_M quality-first Best local target for Q5 plus larger contexts and other apps.
16 GB Apple Silicon Not recommended None Current 27B artifacts leave too little memory headroom.

Expected rows are capacity guidance, not measured benchmark claims. Q5_K_M is measured on M1 Max 32 GB, but 48 GB+ is the practical recommendation for comfortable use.

Model Overview

Base model field Value
Base model Qwen/Qwen3.6-27B
Parameter class 27B dense model
HF architecture Qwen3_5ForConditionalGeneration
Layer count 64 language layers
Hidden size 5120
Native context 262,144 tokens in the base model; practical local context depends on RAM, KV/cache settings and apps
Public GGUF modality text inference release track
Runtime target Apple Silicon Metal through stock llama.cpp

Runtime Compatibility

  • llama.cpp, llama-cli, llama-server: supported.
  • LM Studio and Ollama local GGUF import: expected to work as standard GGUF loaders.
  • OpenTQ custom runtime: not required for this repo.
  • Native TurboQuant/OpenTQ tensor formats: separate release track, not mixed into this GGUF repo.
  • MLX: not the target runtime for this GGUF track.

Quick Start

1. Download A GGUF

hf download zlaabsi/Qwen3.6-27B-OTQ-GGUF Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf --local-dir models/Qwen3.6-27B-OTQ-GGUF

Use Q3_K_M first on 32 GB Macs. Use Q4_K_M when you can afford the extra memory. Use Q5_K_M for quality-first local inference when headroom matters less than fidelity.

2. Build llama.cpp With Metal

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_METAL=ON -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=ON
cmake --build build -j

3. Run Locally

./build/bin/llama-cli \
  -m models/Qwen3.6-27B-OTQ-GGUF/Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf \
  -ngl 99 \
  -fa \
  -c 8192 \
  --temp 0.6 \
  --top-p 0.95 \
  -p "<|im_start|>user\nExplain the tradeoff between prefill and decode throughput.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"

4. Serve An OpenAI-Compatible API

./build/bin/llama-server \
  -m models/Qwen3.6-27B-OTQ-GGUF/Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf \
  -ngl 99 \
  -fa \
  -c 8192 \
  --host 0.0.0.0 \
  --port 8080
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3.6-27b-otq","messages":[{"role":"user","content":"Give me a 3-bullet summary of OpenTQ."}],"temperature":0.6}'

llama.cpp Settings

Setting Recommended value Why
GPU layers -ngl 99 Offload all supported layers to Metal on Apple Silicon
FlashAttention -fa / -fa on Critical for long-context prefill wall-clock
Context -c 8192 first Validated release gate; increase only after checking memory headroom
Prompt format Qwen chat template Keep `<
Sampling --temp 0.6 --top-p 0.95 Good default for general chat; tighten for deterministic evals
Server llama-server Use for OpenAI-compatible local apps and agents

Apple Silicon Guide

Machine class Recommendation
32 GB MacBook Pro / Mac Studio Prefer Q3_K_M for headroom, especially with agentic prompts and other apps open.
48-64 GB Apple Silicon Prefer Q4_K_M for balance; use Q5_K_M for quality-first local inference.
96 GB+ Apple Silicon Prefer Q5_K_M; larger native/custom candidates remain separate until runtime gates pass.
Agent workloads with large tool context Measure total wall-clock time. Decode-only tok/s hides prefill cost.

Benchmarks

Variant Test Throughput Backend Size
Q3_K_M pp8192 107.09 +/- 0.00 MTL,BLAS 13.47 GiB
Q3_K_M tg128 10.19 +/- 0.00 MTL,BLAS 13.47 GiB
Q4_K_M pp8192 106.98 +/- 0.00 MTL,BLAS 16.81 GiB
Q4_K_M tg128 9.62 +/- 0.00 MTL,BLAS 16.81 GiB
Q5_K_M pp8192 93.94 +/- 0.00 MTL,BLAS 19.91 GiB
Q5_K_M tg128 8.87 +/- 0.00 MTL,BLAS 19.91 GiB

Runtime frontier

Prefill decode tradeoff

Release scorecard

The plots compare the quantized OTQ artifacts against each other on measured release data. Official Qwen scores are kept as a reference table, not plotted as a fake delta.

Practical Mini-Subset Quality Signals

See Paired BF16-vs-GGUF Quality Signal. The table and chart are placed near the top of this card because they are the main same-subset quantization-regression evidence.

Release Evaluation

Variant Suite Passed Pass rate Mean latency p95 latency
Q3_K_M smoke 5/5 1.0 7.605s 22.371s
Q3_K_M release 10/10 1.0 9.325s 26.905s
Q4_K_M smoke 5/5 1.0 8.333s 23.826s
Q4_K_M release 10/10 1.0 9.907s 21.395s
Q5_K_M smoke 5/5 1.0 16.046s 34.387s
Q5_K_M release 10/10 1.0 16.955s 34.58s

Release Gate

Variant Metadata Bounded generation 8K llama-bench Smoke gate Release gate Timestamp
Q3_K_M passed passed (24.246s) passed (91.371s) 5/5 10/10 2026-04-27T19:38:50.320253+00:00
Q4_K_M passed passed (22.348s) passed (93.163s) 5/5 10/10 2026-04-27T19:43:25.174228+00:00
Q5_K_M passed passed (44.272s) passed (119.964s) 5/5 10/10 2026-04-28T23:18:17.700281+00:00

Release gate latency

Release gate coverage

Official Baseline vs OTQ Claims

Item Status
Official Qwen3.6-27B source scores Imported from the official model card into benchmarks/official_qwen36_baseline.csv
OTQ Q3_K_M / Q4_K_M / Q5_K_M runtime Measured with llama-bench on M1 Max
OTQ functional release gates Measured with deterministic smoke and extended suites
Official benchmark deltas Not claimed yet; requires running the same tasks/scoring on the GGUF artifacts

Transparency Files

Each variant has full release evidence under evidence/<quant>/:

  • validation.json
  • quality-eval.json
  • release-eval.json
  • opentq-plan.json
  • tensor-types.txt
  • tensor-types.annotated.tsv
  • quantize-dry-run.log

Reproduce Release Evidence

git clone https://github.com/zlaabsi/opentq
cd opentq
uv sync
uv run python scripts/stage_qwen36_otq_gguf_repo.py
uv run python scripts/build_qwen36_release_report.py --repo artifacts/hf-gguf-canonical/Qwen3.6-27B-OTQ-GGUF

Run the same style of OTQ release evaluation:

LLAMA_CPP_DIR=/path/to/llama.cpp ./scripts/run_qwen36_otq_eval.sh

Run the long-context benchmark directly:

./build/bin/llama-bench \
  -m models/Qwen3.6-27B-OTQ-GGUF/Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf \
  -ngl 99 \
  -fa on \
  -p 8192 \
  -n 128 \
  -r 1 \
  --no-warmup
Downloads last month
3,688
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zlaabsi/Qwen3.6-27B-OTQ-GGUF

Base model

Qwen/Qwen3.6-27B
Quantized
(291)
this model

Collection including zlaabsi/Qwen3.6-27B-OTQ-GGUF