Qwen3.6-27B-NVFP4

NVFP4 (W4A4) quantization of Qwen/Qwen3.6-27B, produced with NVIDIA Model Optimizer. The vision encoder is preserved in BF16 so multimodal (image + video) capability is intact; only the 27B language backbone is quantized.

  • Weights size: 19 GB (vs ~54 GB BF16, ~2.8× compression)
  • Hardware target: NVIDIA Blackwell (RTX 5090, B100/B200) — native FP4 tensor cores
  • Calibration: mixed image-text + text-only, 128 + 128 samples (see below)
  • KV cache: FP8 (E4M3, fp8_cast — amax set to FP8 range without data-driven calibration of K/V tensors). Matches the convention used by nvidia/Qwen3-32B-NVFP4, nvidia/Qwen3.5-397B-A17B-NVFP4, nvidia/Gemma-4-31B-IT-NVFP4.

What's quantized

Component Format
Language model linear weights/activations (full + linear-attention transformer blocks) NVFP4 (W4A4, group size 16, FP8 E4M3 scales)
KV cache FP8 (E4M3)
Vision encoder (model.visual.*, 27 SigLIP-style blocks) BF16 (untouched)
Vision-to-LM projector + image/video embeddings BF16
lm_head BF16
Linear-attention conv1d (48 of 64 layers in this hybrid model) BF16
MTP head (mtp, mtp.layers.0) BF16
Routers / mlp.gate.* BF16

The exclusion list is recorded in hf_quant_config.json (exclude_modules).

Use with vLLM

vllm serve berkerdooo/Qwen3.6-27B-NVFP4 \
  --quantization modelopt \
  --tensor-parallel-size 2 \
  --trust-remote-code

Then send an OpenAI-compatible chat-completions request with an image URL. NVFP4 inference requires Blackwell hardware and a vLLM build that recognizes the qwen3_5 model type (architecture id is Qwen3_5ForConditionalGeneration).

Mixed calibration

Two distinct dataloaders feed activation statistics into the same NVFP4 quantizers (the calibration algorithm: max takes max(amax_text, amax_image) per block, so order does not matter):

  1. Image-text pass (128 samples from nvidia/Nemotron-VLM-Dataset-v2, subsets: sparsetables, plotqa_cot, wiki_en). Real images are processed by AutoProcessor, pixel_values are passed through full_model.forward(...), exercising the vision encoder, the vision-to-LM projector, and the LM with real projected vision tokens at the head of context.
  2. Text-only pass (128 samples from abisee/cnn_dailymail). Plain tokenized articles flow through the extracted language_model.forward(...), giving clean amax estimates for channels that vision tokens do not stress.

This combines NVIDIA's text-only convention (used for nvidia/Qwen3.5-397B-A17B-NVFP4) with the multimodal coverage that an image-aware deployment actually exercises at inference time.

Reproducing

# Patched hf_ptq.py to chain --calib_with_images + --dataset.
# When both flags are present, _ChainedCalibLoaders runs the VLM forward
# through full_model and the text forward through language_model.

CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES=0,1 \
python examples/llm_ptq/hf_ptq.py \
  --pyt_ckpt_path Qwen/Qwen3.6-27B \
  --qformat nvfp4 \
  --calib_with_images \
  --dataset cnn_dailymail \
  --calib_size 128 \
  --batch_size 1 \
  --gpu_max_mem_percentage 0.70 \
  --trust_remote_code \
  --export_path ./qwen3.6-27b-nvfp4

Internally hf_ptq.py calls extract_and_prepare_language_model_from_vl(), which walks the model.model.language_model lineage and attaches a disabled quantizer config to every non-LM submodule (vision tower + projector + embeddings) so the export keeps them in BF16. The _default_disabled_quantizer_cfg (in modelopt/torch/quantization/config.py) already covers *linear_attn.conv1d*, *mixer.conv1d*, *lm_head*, *router*, *output_layer*, BatchNorm, LeakyReLU. MTP layers (mtp, mtp.layers.0) are detected at export time and added to the exclusion list automatically.

NVFP4 format

  • 4-bit element: 2-bit mantissa + 1-bit exponent + sign (E2M1), packed uint8
  • 16-element block scale: FP8 E4M3 (weight_scale)
  • Per-tensor global scale: FP32 (weight_scale_2)

Environment used

  • Python 3.13
  • PyTorch 2.11.0 + CUDA 13.0
  • transformers 5.7.0 (the qwen3_5 model type is only recognized by transformers ≥5.0)
  • nvidia-modelopt (editable, from Model-Optimizer main, with a small local patch to chain VLM and text calibration loaders)
  • 2× NVIDIA RTX 5090 (Blackwell, 32 GB each)

License

Inherits Apache 2.0 from the base model.

Downloads last month
48,165
Safetensors
Model size
15B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for berkerdooo/Qwen3.6-27B-NVFP4

Base model

Qwen/Qwen3.6-27B
Quantized
(416)
this model