Qwen3.6-27B-NVFP4
NVFP4 (W4A4) quantization of Qwen/Qwen3.6-27B, produced with NVIDIA Model Optimizer. The vision encoder is preserved in BF16 so multimodal (image + video) capability is intact; only the 27B language backbone is quantized.
- Weights size: 19 GB (vs ~54 GB BF16, ~2.8× compression)
- Hardware target: NVIDIA Blackwell (RTX 5090, B100/B200) — native FP4 tensor cores
- Calibration: mixed image-text + text-only, 128 + 128 samples (see below)
- KV cache: FP8 (E4M3,
fp8_cast— amax set to FP8 range without data-driven calibration of K/V tensors). Matches the convention used bynvidia/Qwen3-32B-NVFP4,nvidia/Qwen3.5-397B-A17B-NVFP4,nvidia/Gemma-4-31B-IT-NVFP4.
What's quantized
| Component | Format |
|---|---|
| Language model linear weights/activations (full + linear-attention transformer blocks) | NVFP4 (W4A4, group size 16, FP8 E4M3 scales) |
| KV cache | FP8 (E4M3) |
Vision encoder (model.visual.*, 27 SigLIP-style blocks) |
BF16 (untouched) |
| Vision-to-LM projector + image/video embeddings | BF16 |
lm_head |
BF16 |
Linear-attention conv1d (48 of 64 layers in this hybrid model) |
BF16 |
MTP head (mtp, mtp.layers.0) |
BF16 |
Routers / mlp.gate.* |
BF16 |
The exclusion list is recorded in hf_quant_config.json (exclude_modules).
Use with vLLM
vllm serve berkerdooo/Qwen3.6-27B-NVFP4 \
--quantization modelopt \
--tensor-parallel-size 2 \
--trust-remote-code
Then send an OpenAI-compatible chat-completions request with an image URL.
NVFP4 inference requires Blackwell hardware and a vLLM build that recognizes the qwen3_5 model type (architecture id is Qwen3_5ForConditionalGeneration).
Mixed calibration
Two distinct dataloaders feed activation statistics into the same NVFP4 quantizers (the calibration algorithm: max takes max(amax_text, amax_image) per block, so order does not matter):
- Image-text pass (128 samples from
nvidia/Nemotron-VLM-Dataset-v2, subsets:sparsetables,plotqa_cot,wiki_en). Real images are processed byAutoProcessor,pixel_valuesare passed throughfull_model.forward(...), exercising the vision encoder, the vision-to-LM projector, and the LM with real projected vision tokens at the head of context. - Text-only pass (128 samples from
abisee/cnn_dailymail). Plain tokenized articles flow through the extractedlanguage_model.forward(...), giving clean amax estimates for channels that vision tokens do not stress.
This combines NVIDIA's text-only convention (used for nvidia/Qwen3.5-397B-A17B-NVFP4) with the multimodal coverage that an image-aware deployment actually exercises at inference time.
Reproducing
# Patched hf_ptq.py to chain --calib_with_images + --dataset.
# When both flags are present, _ChainedCalibLoaders runs the VLM forward
# through full_model and the text forward through language_model.
CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES=0,1 \
python examples/llm_ptq/hf_ptq.py \
--pyt_ckpt_path Qwen/Qwen3.6-27B \
--qformat nvfp4 \
--calib_with_images \
--dataset cnn_dailymail \
--calib_size 128 \
--batch_size 1 \
--gpu_max_mem_percentage 0.70 \
--trust_remote_code \
--export_path ./qwen3.6-27b-nvfp4
Internally hf_ptq.py calls extract_and_prepare_language_model_from_vl(), which walks the model.model.language_model lineage and attaches a disabled quantizer config to every non-LM submodule (vision tower + projector + embeddings) so the export keeps them in BF16. The _default_disabled_quantizer_cfg (in modelopt/torch/quantization/config.py) already covers *linear_attn.conv1d*, *mixer.conv1d*, *lm_head*, *router*, *output_layer*, BatchNorm, LeakyReLU. MTP layers (mtp, mtp.layers.0) are detected at export time and added to the exclusion list automatically.
NVFP4 format
- 4-bit element: 2-bit mantissa + 1-bit exponent + sign (E2M1), packed
uint8 - 16-element block scale: FP8 E4M3 (
weight_scale) - Per-tensor global scale: FP32 (
weight_scale_2)
Environment used
- Python 3.13
- PyTorch 2.11.0 + CUDA 13.0
- transformers 5.7.0 (the
qwen3_5model type is only recognized by transformers ≥5.0) - nvidia-modelopt (editable, from
Model-Optimizermain, with a small local patch to chain VLM and text calibration loaders) - 2× NVIDIA RTX 5090 (Blackwell, 32 GB each)
License
Inherits Apache 2.0 from the base model.
- Downloads last month
- 48,165
Model tree for berkerdooo/Qwen3.6-27B-NVFP4
Base model
Qwen/Qwen3.6-27B