Qwen3.6-27B-LNARIZE-NVFP4

⚡ Two headline numbers, one GPU

1 GPU holds ~1M tokens of KV state — 4 concurrent streams × 256K tokens each (= 1,025,988 tokens of cached KV) on a single RTX PRO 6000 Blackwell 96 GB. The same setup without KVTC OOMs at engine init.

And 120 tok/s peak single-stream decode — measured at 123.8 tok/s on a coding prompt via the bundled Docker OpenAI-compatible server (KVTC + MTP n=3 + cudagraph); ~95-122 tok/s sustained across reading-comprehension / coding / math / philosophy benchmarks.

The pitch: Lna-Lab models are fast where it matters · faithful in output · don't eat your VRAM. Above ~150 tokens of context (anything past a one-sentence query), KVTC + MTP + cudagraph is strictly faster than the same model without KVTC — and the freed VRAM lets a single GPU hold contexts that nothing else on the same hardware can.

Verified working today (2026-04-27): the bundled serve/ Docker package has been built end-to-end on Blackwell SM120a (CUDA 12.9.1, vLLM 0.19.1, Lna-Lab KVTC fork) and confirmed serving the OpenAI API at the headline speed. See serve/SERVE.md for the 3-command deploy.

What's LNARIZE?

LNARIZE is Lna-Lab's release line for production inference on Qwen3.6-27B. The single goal: a model that pushes the balance of decode speed and VRAM usage at the prompt sizes you actually use — not at synthetic short-prompt micro-benchmarks, but where real chat history, RAG context, multi-turn conversation, and document-scale inputs live (1K-32K tokens).

Each LNARIZE release bundles three production techniques that compose well on Blackwell-class GPUs:

  1. NVFP4 weights (NVIDIA ModelOpt 0.43.0) — 4× smaller than bf16, fast on SM120 tensor cores.
  2. MTP (Multi-Token Prediction) speculative decoding — n=3 draft head grafted into the main model, ~80% acceptance.
  3. KVTC (KV-cache Tensor Compression) — PCA-based per-head KV compression from NVIDIA's ICLR 2026 paper, integrated via the Lna-Lab fork of OnlyTerp/kvtc with the 11 patches needed to make it run cleanly on vLLM 0.19 V1 + cudagraph.

The headline finding from integrating all three: at the production prompt sizes that matter, KVTC + MTP + cudagraph is strictly faster than the same model without KVTC, while the compressed K/V uses 2-3× less VRAM in the middle range. That's the speed × VRAM balance LNARIZE is aiming for.

Where the value shows

LNARIZE is not built to win the 60-token toy bench — it's built to win the 1K-16K production bench.

Prompt size Where this lives KVTC vs baseline (single GPU, MTP n=3, cudagraph)
< 130 tokens one-shot probes, "hello" tests -7% (only "slow zone"; KVTC wrapper has no compression to amortize)
163 tokens first message of a chat +79% ← the inversion already starts here
~700 tokens typical chat with history, system prompt + first turn +63%
4011 tokens (×4 in-flight) RAG context, code review snippets +47%
8081 tokens (×4 in-flight) long document QA, multi-turn agents +25%
16147 tokens (×4 in-flight) full conversation history, file-level code +18%

(Single GPU RTX PRO 6000 Blackwell SM120, vLLM 0.19.1rc1, full bench data: JetQuant/KVTC/benchmarks.)

The KVTC compute cost (PCA inverse + RoPE) is more than offset by reading 2-3× less K/V from HBM at memory-bandwidth-bound contexts. The wider the context, the bigger the saving on bytes; the more concurrent contexts you fit (max-num-seqs ↑), the more aggregate throughput per GPU.

What this model is

Component Detail
Base lineage huihui-ai/Huihui-Qwen3.6-27B-abliterated (heretic / ara / abliterated lineage of Qwen/Qwen3.6-27B)
Architecture Qwen3_5ForConditionalGeneration — VLM (vision tower preserved), 64 layers / 16 full-attention / head_dim=256 / 4 KV heads, rope_theta=10M
Quantization NVFP4 via NVIDIA ModelOpt 0.43.0 (Blackwell SM120-tuned). Same packing as the immediate parent sakamakismile/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP
Speculative decoding MTP head grafted, recommended num_speculative_tokens=3
KV-cache compression KVTC K2V4 — calibration shipped in this repo as kvtc_calibration.pt (128 entries, K2V4, 34 MB). Validated to be reusable across the entire Qwen3.6-27B fine-tune family.
Disk footprint 20.5 GB safetensors + 34 MB calibration

The vision tower is preserved in this build (language_model_only: false in config.json); pass --language-model-only at vLLM launch if you want pure-text inference and skip the multimodal preprocessor.

Use it — beginner-friendly walkthrough

This section assumes you have never run a vLLM model with KVTC before. Follow steps 1–4 in order and you'll have a working LNARIZE inference in under 10 minutes.

What's in this repo (file-by-file)

File What it is Do you need to touch it?
model.safetensors (20.5 GB) NVFP4-quantized weights + MTP draft head No — vLLM loads it automatically
kvtc_calibration.pt (34 MB) PCA basis & quantization scales for KVTC No — hf_hub_download pulls it for you
config.json Architecture description (Qwen3_5ForConditionalGeneration) No
hf_quant_config.json NVFP4 packing spec No
tokenizer*.json / chat_template.jinja Standard Qwen3 tokenizer assets No
preprocessor_config.json / video_preprocessor_config.json Image / video preprocessor (vision tower) Only if you want to feed images

Prerequisites

Need Why Notes
NVIDIA Blackwell GPU (RTX PRO 6000 96 GB / B200 / H100 etc.) NVFP4 weights need SM120 tensor cores Older GPUs (Ada / Hopper) won't load NVFP4 cleanly
Python 3.11 or 3.12 vLLM 0.19+ requirement
vLLM ≥ 0.19.1 V1 engine + cudagraph compilation The headline numbers depend on this version
CUDA 12.4+ driver Blackwell NVFP4 support
~25 GB free disk Model + calibration Plus a few GB for the KVTC fork

Step 1 — Install vLLM and HF Hub

pip install --upgrade "vllm>=0.19.1" huggingface_hub

Step 2 — Clone the KVTC fork (needs 11 patches not yet in OnlyTerp upstream)

git clone https://github.com/Shinka-Man/kvtc.git ~/kvtc
cd ~/kvtc && ln -sfn src kvtc          # so `from kvtc.X import Y` resolves

(No pip install needed for kvtc — the script below adds ~/kvtc to PYTHONPATH at runtime.)

Step 3 — Save the runner script

Save this as run_lnarize.py anywhere on your machine:

"""Drop-in interactive runner for sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4.
Copy-paste, run, get output. KVTC + MTP + cudagraph all wired automatically.
"""
import os, sys

os.environ.setdefault("VLLM_ALLOW_INSECURE_SERIALIZATION", "1")  # KVTC apply_model RPC needs this
sys.path.insert(0, os.path.expanduser("~/kvtc"))                  # KVTC fork lives here

from vllm import LLM, SamplingParams
from huggingface_hub import hf_hub_download
from kvtc.calibrate_vllm import VLLMCalibrationCollector
from kvtc.vllm_backend import hook_engine

REPO = "sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4"

# ---- Engine ----
llm = LLM(
    model=REPO,
    quantization="modelopt",                                       # NVFP4 weights via ModelOpt
    language_model_only=True,                                      # text-only; remove for VLM (image input)
    speculative_config={"method": "qwen3_5_mtp", "num_speculative_tokens": 3},  # MTP n=3
    gpu_memory_utilization=0.85,
    max_model_len=16384,                                           # interactive default; bump to 262144 for long docs
    max_num_seqs=16,
    enable_prefix_caching=False,                                   # required for the current KVTC bootstrap (see note below)
    async_scheduling=False,                                        # same — temporary, removable once the upstream PR lands
    # NB: do NOT pass enforce_eager — cudagraph delivers the headline numbers
)

# ---- KVTC ----
calib_path = hf_hub_download(repo_id=REPO, filename="kvtc_calibration.pt")
cal = VLLMCalibrationCollector.load(calib_path)
cal.sink_tokens = 4
cal.window_tokens = 128
hook_engine(llm, cal, auto_activate=True, use_triton=True)
print(f"[lnarize] KVTC active, calibration: {len(cal.entries)} entries, sink=4 window=128")

# ---- Generate ----
out = llm.generate(
    ["Explain how transformer attention scales with context length, with one concrete example."],
    SamplingParams(max_tokens=400, temperature=0.0),
)
print("\n" + out[0].outputs[0].text)

Step 4 — Run it

export CUDA_VISIBLE_DEVICES=0           # pick any free GPU on your box
python run_lnarize.py

What you should see (rough timing on a single RTX PRO 6000 96 GB Blackwell):

Phase Time What's happening
1. vLLM load + compile ~30 sec Loading 20 GB safetensors, compiling cudagraph
2. Calibration download (first run only) ~5 sec 34 MB kvtc_calibration.pt from HF Hub
3. KVTC hook install ~1 sec [lnarize] KVTC active, calibration: 128 entries, sink=4 window=128
4. Warmup pass ~5 sec vLLM's internal warmup
5. Generate (400 tok decode) ~3-4 sec ~95-120 tok/s decode
Total wall time ~50 sec first run, ~15 sec subsequent (cudagraph cache reused after first run)

If you see KVTC active and ~100 tok/s output, you're good.

Common stumbles

Symptom Fix
TypeError: Object of type <class 'function'> is not serializable Forgot VLLM_ALLOW_INSECURE_SERIALIZATION=1 (it's set in the script — make sure you didn't unset it)
ModuleNotFoundError: No module named 'kvtc' The sys.path.insert line points to your kvtc clone; verify ~/kvtc exists and contains src/
OOM at engine init with default settings Drop gpu_memory_utilization=0.85 to 0.75, or reduce max_num_seqs
Output looks identical to baseline KVTC may not be engaging at very short prompts (< 130 tok). Try a longer prompt or set sink_tokens=0 window_tokens=0 to force compression
head_dim must be ≤ 128 Triton error You're hitting the upstream KVTC kernel limit; the Lna-Lab fork dispatches head_dim=256 to torch fallback automatically — make sure you're on the fork's master, not OnlyTerp's main

Long-context recipe (up to 1M tokens of KV state on a single GPU)

For batch processing of very long documents (load 4 docs of ~256K tokens each, then decode summaries / Q&A / etc.), change just two lines in the recipe above:

    max_model_len=262144,        # 256K per stream (model native max)
    max_num_seqs=4,              # 4 concurrent × 256K = ~1M tokens of KV state held simultaneously

…and bump gpu_memory_utilization=0.92 to give the cache pool more headroom.

Setup Behaviour on RTX PRO 6000 96 GB
KVTC + MTP + cudagraph, max_num_seqs=4, 256K each Fits — generates output successfully
Same without KVTC (baseline + MTP) OOMs at engine init — baseline cannot reserve 1M tokens of KV cache on 96 GB

Important caveat: the first-token latency for a 256K-token prefill is on the order of several minutes per stream (this is true of any inference engine at this size — it's the cost of the prefill compute, not KVTC overhead). Use this mode for batch document processing, not interactive chat. For interactive chat, stick to the 16K default; KVTC is consistently faster than baseline above ~150-token prompts in that regime.

vLLM serve (OpenAI-compatible server) — bundled in this repo

The serve/ subfolder ships a drop-in OpenAI-compatible server with KVTC + MTP + cudagraph all wired in. Three commands and you have a localhost:9000/v1/chat/completions endpoint:

huggingface-cli download sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4 \
  --include 'serve/*' --local-dir ./lnarize-serve
cd lnarize-serve/serve
docker compose up

For long-context "1M held" mode, set 3 env vars:

LNARIZE_MAX_MODEL_LEN=262144 LNARIZE_MAX_NUM_SEQS=4 LNARIZE_GPU_MEM_UTIL=0.92 \
  docker compose up

Full deployment guide: serve/SERVE.md. Files in serve/:

  • lnarize_serve.py — Python wrapper (drop-in replacement for vllm serve …)
  • Dockerfilenvidia/cuda:12.6 + vLLM 0.19.1 + the Lna-Lab KVTC fork + the wrapper
  • entrypoint.sh — env-var → CLI translation
  • docker-compose.yml — one-command deploy with healthcheck + persistent caches
  • SERVE.md — beginner-friendly deployment walkthrough

A note on enable_prefix_caching=False

The recipe above passes enable_prefix_caching=False and async_scheduling=False. These are current bootstrap requirements, not permanent choices:

  • The KVTC hooks are installed via engine_core.collective_rpc("apply_model", …) after engine init; the prefix-cache and async-scheduler interactions with that path haven't been hardened yet.
  • Production serving usually wants prefix caching ON for the throughput uplift. The fix lives in Shinka-Man/kvtc and the open upstream issue OnlyTerp/kvtc#6; both are in active progress.
  • If your workload depends on prefix caching today, you can A/B-toggle KVTC off (--kvtc-disable in serve/) per request mix, or wait for the next iteration.

Tuning knobs

Knob Default Effect
sink_tokens 4 First N tokens kept raw (anchors attention reliably)
window_tokens 128 Last N tokens kept raw (the "recent context" baseline attends densely over)
auto_activate True Engage KVTC decode the moment prior cached context exists
bit_budget_ratio 0.25 (K2V4 baked into the calibration) Set at calibration time; this repo ships K2V4

For maximum compression (longest contexts, accept some output divergence): sink=0 window=0 puts everything through PCA. Output stays semantically faithful but argmax may flip a token here and there.

For byte-exact short-prompt parity: auto_activate=False keeps KVTC in capture-only mode (overhead ~10%, no compression engaged) — flip activation manually when a long prompt arrives.

Why this release exists

Lna-Lab spent 2026-04-27 walking the entire integration of NVIDIA's KVTC into vLLM 0.19 V1 + Qwen3.6-27B family + MTP n=3 + cudagraph. The findings (11 patches across 4 phases, full bench matrix from 60 tokens to 16K context × 1-16 in-flight) are documented in the JetQuant/KVTC hub.

The headline finding: at the prompt sizes production traffic actually lives at, KVTC compression is strictly faster than not having it, and the saved VRAM lets you push concurrency higher for the same hardware budget. LNARIZE-NVFP4 packages that result into a drop-in production model.

Acknowledgments

  • NVIDIA Research for the KVTC paper and the OnlyTerp/kvtc reference implementation.
  • huihui-ai for the abliteration lineage this model builds on.
  • Qwen team for the Qwen3.6-27B base.
  • vLLM team for the V1 engine + cudagraph compilation that lets the integration land at production speed.
  • Lna-Lab internal: full integration story in Shinka-Man/kvtc commits and the JetQuant/KVTC docs hub.

License

Apache 2.0, inherited from the Qwen3.6-27B base.

Downloads last month
2,209
Safetensors
Model size
17B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4

Base model

Qwen/Qwen3.6-27B
Quantized
(21)
this model