Qwen3.6 35B A3B HauhauCS Uncensored NVFP4

Uncensored Qwen3.6 35B A3B MoE quantized to NVFP4 compressed-tensors for vLLM on NVIDIA Blackwell / RTX 5090.

  • 35B total / 3B active MoE
  • HauhauCS Aggressive uncensored source
  • Conservative NVFP4 profile: linear attention and MTP kept in bf16 for quality
  • NVFP4 W4A4 compressed-tensors
  • ~22 GB
  • Runs on one RTX 5090
  • 100K-131K text context target
  • vLLM native loading

The model files are placed at the repository root so Hugging Face shows the weights in the right-side download panel and vllm serve can load the repo directly. The repo intentionally keeps a single root weight set to avoid full-repo snapshot downloads pulling multiple profile variants.

Download

hf download lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 \
  --local-dir ./qwen36-35b-a3b-hauhaucs-nvfp4

vLLM quickstart

VLLM_NVFP4_GEMM_BACKEND=marlin \
vllm serve lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 \
  --served-model-name qwen36-35b-a3b-hauhaucs-nvfp4 \
  --quantization compressed-tensors \
  --kv-cache-dtype fp8 \
  --max-model-len 131072 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 4096 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --trust-remote-code

Local path quickstart:

hf download lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 \
  --local-dir ./qwen36-35b-a3b-hauhaucs-nvfp4

VLLM_NVFP4_GEMM_BACKEND=marlin \
vllm serve ./qwen36-35b-a3b-hauhaucs-nvfp4 \
  --served-model-name qwen36-35b-a3b-hauhaucs-nvfp4 \
  --quantization compressed-tensors \
  --kv-cache-dtype fp8 \
  --max-model-len 131072 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 4096 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --trust-remote-code

Quantization recipe

recipe = QuantizationModifier(
    targets="Linear", scheme="NVFP4",
    ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$",
            "re:.*mlp.shared_expert_gate$", "re:.*linear_attn.*", "re:^mtp.*"],
)
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=1024,
    num_calibration_samples=128,
    moe_calibrate_all_experts=True,
    pipeline="basic",
)

Pipeline:

Q8_K_P GGUF -> step1_convert_qwen36_moe.py -> HF bf16 -> step2_quantize_qwen36_moe.py -> NVFP4

Source models

Acknowledgments

  • HauhauCS for the uncensored GGUF source
  • Qwen for the base model and MTP weights
  • AEON-7 and RedHatAI for conservative quantization approach reference
Downloads last month
2,057
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4