Instructions to use Infatoshi/Qwen3.6-35B-A3B-NVFP4-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Infatoshi/Qwen3.6-35B-A3B-NVFP4-FP8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Infatoshi/Qwen3.6-35B-A3B-NVFP4-FP8") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Infatoshi/Qwen3.6-35B-A3B-NVFP4-FP8") model = AutoModelForImageTextToText.from_pretrained("Infatoshi/Qwen3.6-35B-A3B-NVFP4-FP8") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Infatoshi/Qwen3.6-35B-A3B-NVFP4-FP8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Infatoshi/Qwen3.6-35B-A3B-NVFP4-FP8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Infatoshi/Qwen3.6-35B-A3B-NVFP4-FP8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Infatoshi/Qwen3.6-35B-A3B-NVFP4-FP8
- SGLang
How to use Infatoshi/Qwen3.6-35B-A3B-NVFP4-FP8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Infatoshi/Qwen3.6-35B-A3B-NVFP4-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Infatoshi/Qwen3.6-35B-A3B-NVFP4-FP8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Infatoshi/Qwen3.6-35B-A3B-NVFP4-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Infatoshi/Qwen3.6-35B-A3B-NVFP4-FP8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Infatoshi/Qwen3.6-35B-A3B-NVFP4-FP8 with Docker Model Runner:
docker model run hf.co/Infatoshi/Qwen3.6-35B-A3B-NVFP4-FP8
Qwen3.6-35B-A3B — NVFP4 + FP8 Mixed-Precision Quantization
A mixed-precision quant of Qwen/Qwen3.6-35B-A3B targeting NVIDIA Blackwell (sm_120) hardware FP4 tensor cores. The 256 routed MoE expert projections are NVFP4 (W4A4 group_size 16); full-attention q/k/v/o and the shared expert are FP8 (W8A8 dynamic). Vision tower, DeltaNet linear-attention layers, router gates, embeddings, lm_head, and the 1-layer MTP head are preserved at BF16.
Produced with llm-compressor oneshot() + 64 calibration samples. The MTP head is included as a BF16 shard so vllm --speculative-config.method mtp loads out of the box.
Quality: wikitext-2 perplexity
Identical scoring pipeline across all three models (vLLM 0.19.0, ppl_vllm.py, ctx=512, 581 chunks, 296,472 tokens, kv_cache_dtype=fp8, dtype=bfloat16).
| Model | Format | Disk | PPL | Δ vs BF16 | Rel |
|---|---|---|---|---|---|
Qwen/Qwen3.6-35B-A3B (source) |
BF16 | 67 GB | 8.0481 | — | — |
| This quant | NVFP4 + FP8 mixed | 24 GB | 8.1939 | +0.1458 | +1.81 % |
mmangkad/Qwen3.6-35B-A3B-NVFP4 |
NVFP4 (pure) | 24 GB | 8.1853 | +0.1372 | +1.70 % |
Both quants land within ~1.8 % of BF16 PPL. This mixed-precision recipe essentially ties pure NVFP4 (+0.10 % relative, well within run-to-run noise) while preserving attention + shared expert at FP8.
Single-stream decode throughput (batch=1)
NVIDIA RTX PRO 6000 Blackwell (sm_120, 96 GB), vLLM 0.19.0, CUDA graphs on, max_model_len=32k, --enforce-eager OFF (critical — see caveats).
| Model | tok/s |
|---|---|
| BF16 source | 170.3 |
| This quant | 163.4 |
| This quant + MTP k=1 | 156.9 (acceptance 90.8 %, mean accepted length 1.91) |
At batch=1 on Blackwell, BF16 is actually slightly faster than NVFP4 for this 3B-active MoE because the bottleneck is activation movement + kernel scheduling, not weight bandwidth. NVFP4's wins show up at larger batch sizes where weight bandwidth dominates; it also opens up serving on ~24 GB GPUs (with short context). MTP speculative decoding works (90.8 % draft acceptance at k=1) but the tiny base-step time at batch=1 means the draft overhead cancels the gain; expect MTP to help at longer contexts or with wider spec windows.
Hardware NVFP4 path
vLLM backend selection at load:
INFO nvfp4.py:256 Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend out of potential backends:
['FLASHINFER_TRTLLM', 'FLASHINFER_CUTEDSL', 'FLASHINFER_CUTLASS', 'VLLM_CUTLASS', 'MARLIN']
Kernels dispatched are literal cutlass::arch::Sm120 FP4 grouped GEMMs (__nv_fp4_e2m1 inputs, __nv_bfloat16 accumulator/output), compiled under the 120a architecture-specific target. FP8 attention + shared expert use Blackwell FP8 tensor cores. No CPU fallback.
Usage
vllm serve <local-or-hf-path> \
--trust-remote-code \
--max-model-len 32768
# With MTP speculative decoding (1 draft token):
vllm serve <local-or-hf-path> \
--trust-remote-code \
--max-model-len 32768 \
--speculative-config '{"method":"mtp","num_speculative_tokens":1,"model":"<same-path>"}'
Requirements:
- vLLM >= 0.19 (native
Qwen3_5MoeForConditionalGeneration+ NVFP4 MoE backends) - sm_100+ GPU for hardware NVFP4 execution (Blackwell B100/B200, RTX PRO 6000, 5090, etc.). Hopper/Ampere will fall back to Marlin W4A16 dequant and lose the perf advantage.
Recipe
- NVFP4 (W4A4, group_size=16) — 256 routed expert projections per MoE layer:
re:.*mlp\.experts\.\d+\.gate_proj$re:.*mlp\.experts\.\d+\.up_proj$re:.*mlp\.experts\.\d+\.down_proj$
- FP8 (W8A8 dynamic) — full-attention blocks and shared experts:
re:.*self_attn\.(q|k|v|o)_proj$re:.*shared_expert\.(gate|up|down)_proj$
- BF16 (ignored) —
lm_head,model.embed_tokens, all router gates, 30 linear-attention (DeltaNet) layers, vision tower, MTP head (re:.*mtp.*,re:.*linear_attn.*,re:.*visual.*,re:.*vision_tower.*).
Calibration: 64 samples of an English-dominant text corpus, max_seq_length=2048, moe_calibrate_all_experts=True.
Caveats
- Subgraph trace coverage during calibration — llm-compressor's FX-based
trace_subgraphsreported "Expected 67 subgraphs, but only traced 41" because the hybrid DeltaNet + full-attention layer pattern breaks FX symbolic tracing at some boundaries. Weight quantization is unaffected; dynamic FP8 activation scaling is unaffected. Only NVFP4 W4A4input_global_scalevalues in the un-traced subgraphs may have fallen back to conservative defaults. The +0.01 PPL gap vs pure NVFP4 is consistent with this. --enforce-eagerdrops tok/s ~4x. You will see ~40 tok/s at batch=1 instead of ~165 tok/s. Only use for debugging.- MTP head is BF16, not NVFP4.
- Multimodal is plumbed but untested. Vision tower is preserved at BF16 so capability is not destroyed, but no VQA benchmark was run.
Known sm_120 gotchas
Some FlashInfer CUTLASS MoE autotune tactics (M128_BS_group2, M256_BS_group0) fail to initialize on sm_120 and get skipped (stderr spam at load). Non-fatal — the autotuner falls through to valid tactics. The FLASHINFER_TRTLLM backend is not sm_120-compatible (tcgen05/TMEM instructions); the auto-selector correctly lands on FLASHINFER_CUTLASS instead.
Build notes for anyone reproducing
Two gotchas surfaced while building this artifact, worth knowing if you try a similar recipe on a 256-expert MoE:
- MoE-unfuse peak memory in
llm-compressor. The calibration step that clones each of the 256 fused 3D expert tensors into per-expertnn.Linearmodules does not free the original fused tensors afterwards, which doubles peak VRAM and OOMs on a 96 GB GPU. A small local patch to free the fused originals post-clone is all that's needed. - MTP head not saved by
oneshot(save_compressed=True).AutoModelForImageTextToTextdoes not instantiate the MTP head (it's a training-auxiliary submodule), so the 19 MTP tensors never enter the state_dict and are silently dropped. Fix by merging them back from the BF16 source as a second safetensors shard, and adding a matchingre:.*mtp.*entry toquantization_config.ignoreso vLLM's compressed-tensors loader treats them as BF16 passthrough.
The recipe section above is the full specification — this card is self-contained.
- Downloads last month
- 2,495
Model tree for Infatoshi/Qwen3.6-35B-A3B-NVFP4-FP8
Base model
Qwen/Qwen3.6-35B-A3B