Qwen3.6-35B-A3B — NVFP4 + FP8 Mixed-Precision Quantization

A mixed-precision quant of Qwen/Qwen3.6-35B-A3B targeting NVIDIA Blackwell (sm_120) hardware FP4 tensor cores. The 256 routed MoE expert projections are NVFP4 (W4A4 group_size 16); full-attention q/k/v/o and the shared expert are FP8 (W8A8 dynamic). Vision tower, DeltaNet linear-attention layers, router gates, embeddings, lm_head, and the 1-layer MTP head are preserved at BF16.

Produced with llm-compressor oneshot() + 64 calibration samples. The MTP head is included as a BF16 shard so vllm --speculative-config.method mtp loads out of the box.

Quality: wikitext-2 perplexity

Identical scoring pipeline across all three models (vLLM 0.19.0, ppl_vllm.py, ctx=512, 581 chunks, 296,472 tokens, kv_cache_dtype=fp8, dtype=bfloat16).

Model Format Disk PPL Δ vs BF16 Rel
Qwen/Qwen3.6-35B-A3B (source) BF16 67 GB 8.0481
This quant NVFP4 + FP8 mixed 24 GB 8.1939 +0.1458 +1.81 %
mmangkad/Qwen3.6-35B-A3B-NVFP4 NVFP4 (pure) 24 GB 8.1853 +0.1372 +1.70 %

Both quants land within ~1.8 % of BF16 PPL. This mixed-precision recipe essentially ties pure NVFP4 (+0.10 % relative, well within run-to-run noise) while preserving attention + shared expert at FP8.

Single-stream decode throughput (batch=1)

NVIDIA RTX PRO 6000 Blackwell (sm_120, 96 GB), vLLM 0.19.0, CUDA graphs on, max_model_len=32k, --enforce-eager OFF (critical — see caveats).

Model tok/s
BF16 source 170.3
This quant 163.4
This quant + MTP k=1 156.9 (acceptance 90.8 %, mean accepted length 1.91)

At batch=1 on Blackwell, BF16 is actually slightly faster than NVFP4 for this 3B-active MoE because the bottleneck is activation movement + kernel scheduling, not weight bandwidth. NVFP4's wins show up at larger batch sizes where weight bandwidth dominates; it also opens up serving on ~24 GB GPUs (with short context). MTP speculative decoding works (90.8 % draft acceptance at k=1) but the tiny base-step time at batch=1 means the draft overhead cancels the gain; expect MTP to help at longer contexts or with wider spec windows.

Hardware NVFP4 path

vLLM backend selection at load:

INFO nvfp4.py:256 Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend out of potential backends:
     ['FLASHINFER_TRTLLM', 'FLASHINFER_CUTEDSL', 'FLASHINFER_CUTLASS', 'VLLM_CUTLASS', 'MARLIN']

Kernels dispatched are literal cutlass::arch::Sm120 FP4 grouped GEMMs (__nv_fp4_e2m1 inputs, __nv_bfloat16 accumulator/output), compiled under the 120a architecture-specific target. FP8 attention + shared expert use Blackwell FP8 tensor cores. No CPU fallback.

Usage

vllm serve <local-or-hf-path> \
  --trust-remote-code \
  --max-model-len 32768

# With MTP speculative decoding (1 draft token):
vllm serve <local-or-hf-path> \
  --trust-remote-code \
  --max-model-len 32768 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1,"model":"<same-path>"}'

Requirements:

  • vLLM >= 0.19 (native Qwen3_5MoeForConditionalGeneration + NVFP4 MoE backends)
  • sm_100+ GPU for hardware NVFP4 execution (Blackwell B100/B200, RTX PRO 6000, 5090, etc.). Hopper/Ampere will fall back to Marlin W4A16 dequant and lose the perf advantage.

Recipe

  • NVFP4 (W4A4, group_size=16) — 256 routed expert projections per MoE layer:
    • re:.*mlp\.experts\.\d+\.gate_proj$
    • re:.*mlp\.experts\.\d+\.up_proj$
    • re:.*mlp\.experts\.\d+\.down_proj$
  • FP8 (W8A8 dynamic) — full-attention blocks and shared experts:
    • re:.*self_attn\.(q|k|v|o)_proj$
    • re:.*shared_expert\.(gate|up|down)_proj$
  • BF16 (ignored)lm_head, model.embed_tokens, all router gates, 30 linear-attention (DeltaNet) layers, vision tower, MTP head (re:.*mtp.*, re:.*linear_attn.*, re:.*visual.*, re:.*vision_tower.*).

Calibration: 64 samples of an English-dominant text corpus, max_seq_length=2048, moe_calibrate_all_experts=True.

Caveats

  1. Subgraph trace coverage during calibration — llm-compressor's FX-based trace_subgraphs reported "Expected 67 subgraphs, but only traced 41" because the hybrid DeltaNet + full-attention layer pattern breaks FX symbolic tracing at some boundaries. Weight quantization is unaffected; dynamic FP8 activation scaling is unaffected. Only NVFP4 W4A4 input_global_scale values in the un-traced subgraphs may have fallen back to conservative defaults. The +0.01 PPL gap vs pure NVFP4 is consistent with this.
  2. --enforce-eager drops tok/s ~4x. You will see ~40 tok/s at batch=1 instead of ~165 tok/s. Only use for debugging.
  3. MTP head is BF16, not NVFP4.
  4. Multimodal is plumbed but untested. Vision tower is preserved at BF16 so capability is not destroyed, but no VQA benchmark was run.

Known sm_120 gotchas

Some FlashInfer CUTLASS MoE autotune tactics (M128_BS_group2, M256_BS_group0) fail to initialize on sm_120 and get skipped (stderr spam at load). Non-fatal — the autotuner falls through to valid tactics. The FLASHINFER_TRTLLM backend is not sm_120-compatible (tcgen05/TMEM instructions); the auto-selector correctly lands on FLASHINFER_CUTLASS instead.

Build notes for anyone reproducing

Two gotchas surfaced while building this artifact, worth knowing if you try a similar recipe on a 256-expert MoE:

  1. MoE-unfuse peak memory in llm-compressor. The calibration step that clones each of the 256 fused 3D expert tensors into per-expert nn.Linear modules does not free the original fused tensors afterwards, which doubles peak VRAM and OOMs on a 96 GB GPU. A small local patch to free the fused originals post-clone is all that's needed.
  2. MTP head not saved by oneshot(save_compressed=True). AutoModelForImageTextToText does not instantiate the MTP head (it's a training-auxiliary submodule), so the 19 MTP tensors never enter the state_dict and are silently dropped. Fix by merging them back from the BF16 source as a second safetensors shard, and adding a matching re:.*mtp.* entry to quantization_config.ignore so vLLM's compressed-tensors loader treats them as BF16 passthrough.

The recipe section above is the full specification — this card is self-contained.

Downloads last month
2,495
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Infatoshi/Qwen3.6-35B-A3B-NVFP4-FP8

Quantized
(409)
this model