These release suites are deterministic guardrails, not substitutes for full academic benchmarks.
Official Qwen Baseline
Official Qwen3.6-27B language benchmark scores are imported as an external reference table in benchmarks/official_qwen36_baseline.csv. They are not plotted against OTQ until matching benchmark tasks are run on these GGUF files.
Deltas versus the official Qwen baseline must only be reported for tasks that are actually run on the OTQ artifacts with the same task definition and scoring rule.
OTQ Task Runs
No separate OTQ-only benchmark subset is attached beyond the paired BF16-vs-GGUF mini-subset.
Paired BF16-vs-GGUF Mini-Subset
The staged repo includes a same-task, same-prompt, deterministic 232-sample practical subset comparing the BF16 sidecar against Q3_K_M, Q4_K_M, and Q5_K_M.
This is a quantization-regression signal, not a full official benchmark replacement. Do not compare its small-subset mmlu_pro or gpqa rates directly to the Qwen model-card full-harness scores.
CSV Data
benchmarks/throughput.csv
benchmarks/eval_summary.csv
benchmarks/category_pass_rate.csv
benchmarks/artifacts.csv
benchmarks/tensor_allocation.csv
benchmarks/category_tensor_allocation.csv
benchmarks/official_qwen36_baseline.csv
benchmarks/paired_bf16_quant_summary.csv
benchmarks/paired_bf16_quant_summary.json
benchmarks/paired_bf16_quant_report.md
benchmarks/quant_eval.csv when separate OTQ-only task runs are present