How to use from
vLLM
Install from pip and serve model
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'
Use Docker
docker model run hf.co/lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4
Quick Links

Qwen3.6 35B A3B HauhauCS Uncensored NVFP4

Uncensored Qwen3.6 35B A3B MoE quantized to NVFP4 compressed-tensors for vLLM on NVIDIA Blackwell / RTX 5090.

  • 35B total / 3B active MoE
  • HauhauCS Aggressive uncensored source
  • Conservative NVFP4 profile: linear attention and MTP kept in bf16 for quality
  • NVFP4 W4A4 compressed-tensors
  • ~22 GB
  • Runs on one RTX 5090
  • 100K-131K text context target
  • vLLM native loading

The model files are placed at the repository root so Hugging Face shows the weights in the right-side download panel and vllm serve can load the repo directly. The repo intentionally keeps a single root weight set to avoid full-repo snapshot downloads pulling multiple profile variants.

Download

hf download lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 \
  --local-dir ./qwen36-35b-a3b-hauhaucs-nvfp4

vLLM quickstart

VLLM_NVFP4_GEMM_BACKEND=marlin \
vllm serve lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 \
  --served-model-name qwen36-35b-a3b-hauhaucs-nvfp4 \
  --quantization compressed-tensors \
  --kv-cache-dtype fp8 \
  --max-model-len 131072 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 4096 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --trust-remote-code

Local path quickstart:

hf download lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 \
  --local-dir ./qwen36-35b-a3b-hauhaucs-nvfp4

VLLM_NVFP4_GEMM_BACKEND=marlin \
vllm serve ./qwen36-35b-a3b-hauhaucs-nvfp4 \
  --served-model-name qwen36-35b-a3b-hauhaucs-nvfp4 \
  --quantization compressed-tensors \
  --kv-cache-dtype fp8 \
  --max-model-len 131072 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 4096 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --trust-remote-code

Quantization recipe

recipe = QuantizationModifier(
    targets="Linear", scheme="NVFP4",
    ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$",
            "re:.*mlp.shared_expert_gate$", "re:.*linear_attn.*", "re:^mtp.*"],
)
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=1024,
    num_calibration_samples=128,
    moe_calibrate_all_experts=True,
    pipeline="basic",
)

Pipeline:

Q8_K_P GGUF -> step1_convert_qwen36_moe.py -> HF bf16 -> step2_quantize_qwen36_moe.py -> NVFP4

Source models

Acknowledgments

  • HauhauCS for the uncensored GGUF source
  • Qwen for the base model and MTP weights
  • AEON-7 and RedHatAI for conservative quantization approach reference
Downloads last month
3,018
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4