Instructions to use sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4")
model = AutoModelForImageTextToText.from_pretrained("sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4

SGLang

How to use sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4 with Docker Model Runner:
```
docker model run hf.co/sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4
```

Qwen3.6-27B-LNARIZE-NVFP4

⚡ Two headline numbers, one GPU

1 GPU holds ~1M tokens of KV state — 4 concurrent streams × 256K tokens each (= 1,025,988 tokens of cached KV) on a single RTX PRO 6000 Blackwell 96 GB. The same setup without KVTC OOMs at engine init.

And 120 tok/s peak single-stream decode — measured at 123.8 tok/s on a coding prompt via the bundled Docker OpenAI-compatible server (KVTC + MTP n=3 + cudagraph); ~95-122 tok/s sustained across reading-comprehension / coding / math / philosophy benchmarks.

The pitch: Lna-Lab models are fast where it matters · faithful in output · don't eat your VRAM. Above ~150 tokens of context (anything past a one-sentence query), KVTC + MTP + cudagraph is strictly faster than the same model without KVTC — and the freed VRAM lets a single GPU hold contexts that nothing else on the same hardware can.

Verified working today (2026-04-27): the bundled serve/ Docker package has been built end-to-end on Blackwell SM120a (CUDA 12.9.1, vLLM 0.19.1, Lna-Lab KVTC fork) and confirmed serving the OpenAI API at the headline speed. See serve/SERVE.md for the 3-command deploy.

What's LNARIZE?

LNARIZE is Lna-Lab's release line for production inference on Qwen3.6-27B. The single goal: a model that pushes the balance of decode speed and VRAM usage at the prompt sizes you actually use — not at synthetic short-prompt micro-benchmarks, but where real chat history, RAG context, multi-turn conversation, and document-scale inputs live (1K-32K tokens).

Each LNARIZE release bundles three production techniques that compose well on Blackwell-class GPUs:

NVFP4 weights (NVIDIA ModelOpt 0.43.0) — 4× smaller than bf16, fast on SM120 tensor cores.
MTP (Multi-Token Prediction) speculative decoding — n=3 draft head grafted into the main model, ~80% acceptance.
KVTC (KV-cache Tensor Compression) — PCA-based per-head KV compression from NVIDIA's ICLR 2026 paper, integrated via the Lna-Lab fork of OnlyTerp/kvtc with the 11 patches needed to make it run cleanly on vLLM 0.19 V1 + cudagraph.

The headline finding from integrating all three: at the production prompt sizes that matter, KVTC + MTP + cudagraph is strictly faster than the same model without KVTC, while the compressed K/V uses 2-3× less VRAM in the middle range. That's the speed × VRAM balance LNARIZE is aiming for.

Where the value shows

LNARIZE is not built to win the 60-token toy bench — it's built to win the 1K-16K production bench.

Prompt size	Where this lives	KVTC vs baseline (single GPU, MTP n=3, cudagraph)
< 130 tokens	one-shot probes, "hello" tests	-7% (only "slow zone"; KVTC wrapper has no compression to amortize)
163 tokens	first message of a chat	+79% ← the inversion already starts here
~700 tokens	typical chat with history, system prompt + first turn	+63%
4011 tokens (×4 in-flight)	RAG context, code review snippets	+47%
8081 tokens (×4 in-flight)	long document QA, multi-turn agents	+25%
16147 tokens (×4 in-flight)	full conversation history, file-level code	+18%

(Single GPU RTX PRO 6000 Blackwell SM120, vLLM 0.19.1rc1, full bench data: JetQuant/KVTC/benchmarks.)

The KVTC compute cost (PCA inverse + RoPE) is more than offset by reading 2-3× less K/V from HBM at memory-bandwidth-bound contexts. The wider the context, the bigger the saving on bytes; the more concurrent contexts you fit (max-num-seqs ↑), the more aggregate throughput per GPU.

What this model is

Component	Detail
Base lineage	`huihui-ai/Huihui-Qwen3.6-27B-abliterated` (heretic / ara / abliterated lineage of `Qwen/Qwen3.6-27B`)
Architecture	`Qwen3_5ForConditionalGeneration` — VLM (vision tower preserved), 64 layers / 16 full-attention / head_dim=256 / 4 KV heads, rope_theta=10M
Quantization	NVFP4 via NVIDIA ModelOpt 0.43.0 (Blackwell SM120-tuned). Same packing as the immediate parent `sakamakismile/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP`
Speculative decoding	MTP head grafted, recommended `num_speculative_tokens=3`
KV-cache compression	KVTC K2V4 — calibration shipped in this repo as `kvtc_calibration.pt` (128 entries, K2V4, 34 MB). Validated to be reusable across the entire Qwen3.6-27B fine-tune family.
Disk footprint	20.5 GB safetensors + 34 MB calibration

The vision tower is preserved in this build (language_model_only: false in config.json); pass --language-model-only at vLLM launch if you want pure-text inference and skip the multimodal preprocessor.

Use it — beginner-friendly walkthrough

This section assumes you have never run a vLLM model with KVTC before. Follow steps 1–4 in order and you'll have a working LNARIZE inference in under 10 minutes.

What's in this repo (file-by-file)

File	What it is	Do you need to touch it?
`model.safetensors` (20.5 GB)	NVFP4-quantized weights + MTP draft head	No — vLLM loads it automatically
`kvtc_calibration.pt` (34 MB)	PCA basis & quantization scales for KVTC	No — `hf_hub_download` pulls it for you
`config.json`	Architecture description (Qwen3_5ForConditionalGeneration)	No
`hf_quant_config.json`	NVFP4 packing spec	No
`tokenizer*.json` / `chat_template.jinja`	Standard Qwen3 tokenizer assets	No
`preprocessor_config.json` / `video_preprocessor_config.json`	Image / video preprocessor (vision tower)	Only if you want to feed images

Prerequisites

Need	Why	Notes
NVIDIA Blackwell GPU (RTX PRO 6000 96 GB / B200 / H100 etc.)	NVFP4 weights need SM120 tensor cores	Older GPUs (Ada / Hopper) won't load NVFP4 cleanly
Python 3.11 or 3.12	vLLM 0.19+ requirement
vLLM ≥ 0.19.1	V1 engine + cudagraph compilation	The headline numbers depend on this version
CUDA 12.4+ driver	Blackwell NVFP4 support
~25 GB free disk	Model + calibration	Plus a few GB for the KVTC fork

Step 1 — Install vLLM and HF Hub

pip install --upgrade "vllm>=0.19.1" huggingface_hub

Step 2 — Clone the KVTC fork (needs 11 patches not yet in OnlyTerp upstream)

git clone https://github.com/Shinka-Man/kvtc.git ~/kvtc
cd ~/kvtc && ln -sfn src kvtc          # so `from kvtc.X import Y` resolves

(No pip install needed for kvtc — the script below adds ~/kvtc to PYTHONPATH at runtime.)

Step 3 — Save the runner script

Save this as run_lnarize.py anywhere on your machine:

"""Drop-in interactive runner for sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4.
Copy-paste, run, get output. KVTC + MTP + cudagraph all wired automatically.
"""
import os, sys

os.environ.setdefault("VLLM_ALLOW_INSECURE_SERIALIZATION", "1")  # KVTC apply_model RPC needs this
sys.path.insert(0, os.path.expanduser("~/kvtc"))                  # KVTC fork lives here

from vllm import LLM, SamplingParams
from huggingface_hub import hf_hub_download
from kvtc.calibrate_vllm import VLLMCalibrationCollector
from kvtc.vllm_backend import hook_engine

REPO = "sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4"

# ---- Engine ----
llm = LLM(
    model=REPO,
    quantization="modelopt",                                       # NVFP4 weights via ModelOpt
    language_model_only=True,                                      # text-only; remove for VLM (image input)
    speculative_config={"method": "qwen3_5_mtp", "num_speculative_tokens": 3},  # MTP n=3
    gpu_memory_utilization=0.85,
    max_model_len=16384,                                           # interactive default; bump to 262144 for long docs
    max_num_seqs=16,
    enable_prefix_caching=False,                                   # required for the current KVTC bootstrap (see note below)
    async_scheduling=False,                                        # same — temporary, removable once the upstream PR lands
    # NB: do NOT pass enforce_eager — cudagraph delivers the headline numbers
)

# ---- KVTC ----
calib_path = hf_hub_download(repo_id=REPO, filename="kvtc_calibration.pt")
cal = VLLMCalibrationCollector.load(calib_path)
cal.sink_tokens = 4
cal.window_tokens = 128
hook_engine(llm, cal, auto_activate=True, use_triton=True)
print(f"[lnarize] KVTC active, calibration: {len(cal.entries)} entries, sink=4 window=128")

# ---- Generate ----
out = llm.generate(
    ["Explain how transformer attention scales with context length, with one concrete example."],
    SamplingParams(max_tokens=400, temperature=0.0),
)
print("\n" + out[0].outputs[0].text)

Step 4 — Run it

export CUDA_VISIBLE_DEVICES=0           # pick any free GPU on your box
python run_lnarize.py

What you should see (rough timing on a single RTX PRO 6000 96 GB Blackwell):

Phase	Time	What's happening
1. vLLM load + compile	~30 sec	Loading 20 GB safetensors, compiling cudagraph
2. Calibration download (first run only)	~5 sec	34 MB `kvtc_calibration.pt` from HF Hub
3. KVTC hook install	~1 sec	`[lnarize] KVTC active, calibration: 128 entries, sink=4 window=128`
4. Warmup pass	~5 sec	vLLM's internal warmup
5. Generate (400 tok decode)	~3-4 sec	~95-120 tok/s decode
Total wall time	~50 sec first run, ~15 sec subsequent	(cudagraph cache reused after first run)

If you see KVTC active and ~100 tok/s output, you're good.

Common stumbles

Symptom	Fix
`TypeError: Object of type <class 'function'> is not serializable`	Forgot `VLLM_ALLOW_INSECURE_SERIALIZATION=1` (it's set in the script — make sure you didn't unset it)
`ModuleNotFoundError: No module named 'kvtc'`	The `sys.path.insert` line points to your kvtc clone; verify `~/kvtc` exists and contains `src/`
`OOM` at engine init with default settings	Drop `gpu_memory_utilization=0.85` to `0.75`, or reduce `max_num_seqs`
Output looks identical to baseline	KVTC may not be engaging at very short prompts (< 130 tok). Try a longer prompt or set `sink_tokens=0 window_tokens=0` to force compression
`head_dim must be ≤ 128` Triton error	You're hitting the upstream KVTC kernel limit; the Lna-Lab fork dispatches head_dim=256 to torch fallback automatically — make sure you're on the fork's master, not OnlyTerp's main

Long-context recipe (up to 1M tokens of KV state on a single GPU)

For batch processing of very long documents (load 4 docs of ~256K tokens each, then decode summaries / Q&A / etc.), change just two lines in the recipe above:

    max_model_len=262144,        # 256K per stream (model native max)
    max_num_seqs=4,              # 4 concurrent × 256K = ~1M tokens of KV state held simultaneously

…and bump gpu_memory_utilization=0.92 to give the cache pool more headroom.

Setup	Behaviour on RTX PRO 6000 96 GB
KVTC + MTP + cudagraph, max_num_seqs=4, 256K each	Fits — generates output successfully
Same without KVTC (baseline + MTP)	OOMs at engine init — baseline cannot reserve 1M tokens of KV cache on 96 GB

Important caveat: the first-token latency for a 256K-token prefill is on the order of several minutes per stream (this is true of any inference engine at this size — it's the cost of the prefill compute, not KVTC overhead). Use this mode for batch document processing, not interactive chat. For interactive chat, stick to the 16K default; KVTC is consistently faster than baseline above ~150-token prompts in that regime.

vLLM serve (OpenAI-compatible server) — bundled in this repo

The serve/ subfolder ships a drop-in OpenAI-compatible server with KVTC + MTP + cudagraph all wired in. Three commands and you have a localhost:9000/v1/chat/completions endpoint:

huggingface-cli download sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4 \
  --include 'serve/*' --local-dir ./lnarize-serve
cd lnarize-serve/serve
docker compose up

For long-context "1M held" mode, set 3 env vars:

LNARIZE_MAX_MODEL_LEN=262144 LNARIZE_MAX_NUM_SEQS=4 LNARIZE_GPU_MEM_UTIL=0.92 \
  docker compose up

Full deployment guide: serve/SERVE.md. Files in serve/:

lnarize_serve.py — Python wrapper (drop-in replacement for vllm serve …)
Dockerfile — nvidia/cuda:12.6 + vLLM 0.19.1 + the Lna-Lab KVTC fork + the wrapper
entrypoint.sh — env-var → CLI translation
docker-compose.yml — one-command deploy with healthcheck + persistent caches
SERVE.md — beginner-friendly deployment walkthrough

A note on `enable_prefix_caching=False`

The recipe above passes enable_prefix_caching=False and async_scheduling=False. These are current bootstrap requirements, not permanent choices:

The KVTC hooks are installed via engine_core.collective_rpc("apply_model", …) after engine init; the prefix-cache and async-scheduler interactions with that path haven't been hardened yet.
Production serving usually wants prefix caching ON for the throughput uplift. The fix lives in Shinka-Man/kvtc and the open upstream issue OnlyTerp/kvtc#6; both are in active progress.
If your workload depends on prefix caching today, you can A/B-toggle KVTC off (--kvtc-disable in serve/) per request mix, or wait for the next iteration.

Tuning knobs

Knob	Default	Effect
`sink_tokens`	4	First N tokens kept raw (anchors attention reliably)
`window_tokens`	128	Last N tokens kept raw (the "recent context" baseline attends densely over)
`auto_activate`	`True`	Engage KVTC decode the moment prior cached context exists
`bit_budget_ratio`	0.25 (K2V4 baked into the calibration)	Set at calibration time; this repo ships K2V4

For maximum compression (longest contexts, accept some output divergence): sink=0 window=0 puts everything through PCA. Output stays semantically faithful but argmax may flip a token here and there.

For byte-exact short-prompt parity: auto_activate=False keeps KVTC in capture-only mode (overhead ~10%, no compression engaged) — flip activation manually when a long prompt arrives.

Why this release exists

Lna-Lab spent 2026-04-27 walking the entire integration of NVIDIA's KVTC into vLLM 0.19 V1 + Qwen3.6-27B family + MTP n=3 + cudagraph. The findings (11 patches across 4 phases, full bench matrix from 60 tokens to 16K context × 1-16 in-flight) are documented in the JetQuant/KVTC hub.

The headline finding: at the prompt sizes production traffic actually lives at, KVTC compression is strictly faster than not having it, and the saved VRAM lets you push concurrency higher for the same hardware budget. LNARIZE-NVFP4 packages that result into a drop-in production model.

Acknowledgments

NVIDIA Research for the KVTC paper and the OnlyTerp/kvtc reference implementation.
huihui-ai for the abliteration lineage this model builds on.
Qwen team for the Qwen3.6-27B base.
vLLM team for the V1 engine + cudagraph compilation that lets the integration land at production speed.
Lna-Lab internal: full integration story in Shinka-Man/kvtc commits and the JetQuant/KVTC docs hub.

License

Apache 2.0, inherited from the Qwen3.6-27B base.

Downloads last month: 2,209

Safetensors

Model size

17B params

Tensor type

BF16

F8_E4M3

Model tree for sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4

Base model

Qwen/Qwen3.6-27B

Finetuned

huihui-ai/Huihui-Qwen3.6-27B-abliterated

Quantized

(21)

this model