Instructions to use YuYu1015/Huihui-Qwen3.6-27B-abliterated-int4-AutoRound with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use YuYu1015/Huihui-Qwen3.6-27B-abliterated-int4-AutoRound with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="YuYu1015/Huihui-Qwen3.6-27B-abliterated-int4-AutoRound")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("YuYu1015/Huihui-Qwen3.6-27B-abliterated-int4-AutoRound")
model = AutoModelForImageTextToText.from_pretrained("YuYu1015/Huihui-Qwen3.6-27B-abliterated-int4-AutoRound")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use YuYu1015/Huihui-Qwen3.6-27B-abliterated-int4-AutoRound with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "YuYu1015/Huihui-Qwen3.6-27B-abliterated-int4-AutoRound"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "YuYu1015/Huihui-Qwen3.6-27B-abliterated-int4-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/YuYu1015/Huihui-Qwen3.6-27B-abliterated-int4-AutoRound

SGLang

How to use YuYu1015/Huihui-Qwen3.6-27B-abliterated-int4-AutoRound with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "YuYu1015/Huihui-Qwen3.6-27B-abliterated-int4-AutoRound" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "YuYu1015/Huihui-Qwen3.6-27B-abliterated-int4-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "YuYu1015/Huihui-Qwen3.6-27B-abliterated-int4-AutoRound" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "YuYu1015/Huihui-Qwen3.6-27B-abliterated-int4-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use YuYu1015/Huihui-Qwen3.6-27B-abliterated-int4-AutoRound with Docker Model Runner:
```
docker model run hf.co/YuYu1015/Huihui-Qwen3.6-27B-abliterated-int4-AutoRound
```

Huihui-Qwen3.6-27B-abliterated-int4-AutoRound

English | 繁體中文

English

INT4 AutoRound quantization of huihui-ai/Huihui-Qwen3.6-27B-abliterated, optimized for NVIDIA DGX Spark (GB10 SM121) with Marlin INT4 kernel acceleration.

Reference: Lorbus/Qwen3.6-27B-int4-AutoRound — same recipe (mtp.fc dequantized to BF16 for native vLLM MTP support).

Model Details

Item	Value
Architecture	Dense (27B), 64 layers — 16 × (3 × Gated DeltaNet → FFN + 1 × Gated Attention → FFN)
Base model	Qwen/Qwen3.6-27B
Fine-tuned by	huihui-ai (abliteration, no TransformerLens)
Quantized by	YuYu1015
Model size	~19 GB (vs ~55.6 GB BF16 original)
Context length	Up to 262,144 tokens
Thinking mode	Supported (`enable_thinking: true/false`)
Tool calling	Supported (`qwen3_coder` parser)
MTP	Built-in MTP weights preserved (`qwen3_next_mtp` method)
DFlash	Compatible with z-lab/Qwen3.6-27B-DFlash

Quantization Details

Item	Value
Method	Intel AutoRound v0.12.2
Bits	4
Group size	128
Symmetric	Yes
Format	`auto_round:auto_gptq` (GPTQ-compatible)
Iterations	200
Calibration dataset	`NeelNanda/pile-10k` (auto-round default)
Calibration samples	512
Calibration sequence length	2048
Torch compile	Enabled (`--enable_torch_compile`)
Hardware	NVIDIA DGX Spark (GB10, 128GB unified memory)

Layers Preserved in BF16

The following layers are not quantized to preserve model quality:

Layer	Reason
`lm_head`	Output head, sensitive to quantization noise (auto-excluded by shape)
`embed_tokens`	Input embeddings (auto-excluded by shape)
`linear_attn.in_proj_a/b`	Gated DeltaNet low-rank projections — shapes not divisible by 32 (auto-skipped)
`linear_attn.*` (other)	GDN/DeltaNet layers, may output zeros if quantized
`mtp.fc`	Multi-Token Prediction projection — quantized then dequantized back to BF16

Why dequantize mtp.fc? vLLM's Qwen3_5MTP loader expects unquantized weights for the MTP projection layer. AutoRound quantizes it during the main pass; we then convert it back to BF16 in a post-processing step so MTP speculative decoding works natively.

Speculative Decoding

This model supports two speculative decoding methods:

MTP (uses built-in weights, recommended for low-concurrency):

--speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 1}'

DFlash (requires separate drafter model):

--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-27B-DFlash", "num_speculative_tokens": 15}'

Note: On hybrid GDN architectures, MTP may hit a state-rollback bug (vLLM #39273) at high token-rejection rates. Reduce num_speculative_tokens to 1 if you observe degraded output quality. The DFlash drafter was trained on the original Qwen3.6-27B; acceptance rate on the abliterated variant may be lower.

Serving with vLLM

vllm serve /path/to/model \
    --quantization auto_round \
    --served-model-name qwen3.6-27b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --attention-backend flash_attn \
    --kv-cache-dtype auto \
    --gpu-memory-utilization 0.80 \
    --max-model-len 131072 \
    --max-num-batched-tokens 16384 \
    --max-num-seqs 8 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --performance-mode throughput \
    --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":1}' \
    --trust-remote-code \
    --language-model-only

DGX Spark (SM121) Compatibility Notes

Use --quantization auto_round for Marlin INT4 kernel (SM121 compatible via SM120 binary compat)
FP8 KV cache is not compatible with GDN non-causal attention layers; use --kv-cache-dtype auto
NVFP4 falls back to Marlin W4A16 on SM121 (missing cvt.e2m1x2 PTX instruction) — INT4 AutoRound has no such fallback issue
Runtime FP8 (--quantization fp8) is not compatible with DFlash (drafter inherits FP8 config and crashes)
Dense 27B is bandwidth-bound on DGX Spark's 273 GB/s LPDDR5X — expect ~15-20 tok/s without speculative decoding
--language-model-only skips vision encoder profiling for text-only inference
--performance-mode throughput enables CUDA graphs and kernels for throughput optimization
Clear page cache before starting on UMA: sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Performance Note

Dense 27B activates all parameters per token (vs MoE 35B-A3B's 3B active), making it more accurate but slower than the 35B-A3B variant on the same hardware. Use this checkpoint when quality matters more than throughput; use 35B-A3B for high-throughput multi-user deployments.

Safety Warning

This model has safety filtering removed (abliterated) and may generate sensitive, controversial, or inappropriate content. Users are solely responsible for all consequences arising from its use. Please ensure usage complies with local laws and ethical standards. Not suitable for public-facing or production applications.

Credits

Original Model: Qwen/Qwen3.6-27B by Alibaba Qwen Team
Abliteration: huihui-ai
INT4 Quantization: YuYu1015 on NVIDIA DGX Spark (GB10)
Quantization Tool: Intel AutoRound
Reference Recipe: Lorbus/Qwen3.6-27B-int4-AutoRound — mtp.fc BF16 preservation pattern

繁體中文

huihui-ai/Huihui-Qwen3.6-27B-abliterated 的 INT4 AutoRound 量化版本，針對 NVIDIA DGX Spark (GB10 SM121) 最佳化，使用 Marlin INT4 kernel 加速。

參考：Lorbus/Qwen3.6-27B-int4-AutoRound — 同樣 recipe（mtp.fc dequantize 回 BF16，原生支援 vLLM MTP）。

模型資訊

項目	數值
架構	Dense（27B），64 層 — 16 × (3 × Gated DeltaNet → FFN + 1 × Gated Attention → FFN)
基礎模型	Qwen/Qwen3.6-27B
微調者	huihui-ai（abliteration，無 TransformerLens）
量化者	YuYu1015
模型大小	~19 GB（原版 BF16 約 55.6 GB）
Context 長度	最高 262,144 tokens
思考模式	支援（`enable_thinking: true/false`）
工具呼叫	支援（`qwen3_coder` parser）
MTP	內建 MTP 權重（`qwen3_next_mtp` 方法）
DFlash	相容 z-lab/Qwen3.6-27B-DFlash

量化詳情

項目	數值
方法	Intel AutoRound v0.12.2
位元數	4
Group size	128
對稱量化	是
格式	`auto_round:auto_gptq`（GPTQ 相容）
迭代次數	200
校準資料集	`NeelNanda/pile-10k`（auto-round 預設）
校準樣本數	512
校準序列長度	2048
Torch compile	啟用（`--enable_torch_compile`）
量化硬體	NVIDIA DGX Spark（GB10, 128GB 統一記憶體）

保留 BF16 的層

以下層未被量化以保持模型品質：

層	原因
`lm_head`	輸出頭，對量化雜訊敏感（因 shape 自動排除）
`embed_tokens`	輸入嵌入（因 shape 自動排除）
`linear_attn.in_proj_a/b`	Gated DeltaNet 低秩投影 — shape 不整除 32（自動跳過）
`linear_attn.*`（其他）	GDN/DeltaNet 層，量化後可能輸出零
`mtp.fc`	Multi-Token Prediction 投影 — 量化後手動 dequantize 回 BF16

為什麼 mtp.fc 要 dequantize？ vLLM 的 Qwen3_5MTP loader 要求 MTP 投影層為未量化權重。AutoRound 主流程會把它一併量化，因此後處理階段要轉回 BF16，MTP 投機解碼才能原生運作。

投機解碼

本模型支援兩種投機解碼方式：

MTP（使用內建權重，低併發建議）：

--speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 1}'

DFlash（需額外下載 drafter 模型）：

--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-27B-DFlash", "num_speculative_tokens": 15}'

注意：hybrid GDN 架構在高拒絕率下，MTP 可能觸發 state-rollback bug（vLLM #39273），輸出品質下降時請降為 num_speculative_tokens=1。DFlash drafter 是以原版 Qwen3.6-27B 訓練，在 abliterated 版本上的接受率可能較低。

使用 vLLM 部署

vllm serve /path/to/model \
    --quantization auto_round \
    --served-model-name qwen3.6-27b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --attention-backend flash_attn \
    --kv-cache-dtype auto \
    --gpu-memory-utilization 0.80 \
    --max-model-len 131072 \
    --max-num-batched-tokens 16384 \
    --max-num-seqs 8 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --performance-mode throughput \
    --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":1}' \
    --trust-remote-code \
    --language-model-only

DGX Spark (SM121) 相容性說明

使用 --quantization auto_round 啟用 Marlin INT4 kernel（SM121 透過 SM120 二進制相容性支援）
FP8 KV cache 與 GDN non-causal attention 不相容，請使用 --kv-cache-dtype auto
NVFP4 在 SM121 上會 fallback 到 Marlin W4A16（缺少 cvt.e2m1x2 PTX 指令）— INT4 AutoRound 沒有此 fallback 問題
Runtime FP8（--quantization fp8）與 DFlash 不相容（drafter 繼承 FP8 config 導致 crash）
Dense 27B 在 DGX Spark 273 GB/s LPDDR5X 頻寬下受限，無投機解碼下約 15-20 tok/s
--language-model-only 跳過視覺編碼器 profiling，加速純文字推理啟動
--performance-mode throughput 啟用吞吐量最佳化的 CUDA graphs 和 kernel
UMA 架構啟動前請先清除 page cache：sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

效能說明

Dense 27B 每個 token 都活躍全部參數（vs MoE 35B-A3B 只有 3B 活躍），在同樣硬體上更精確但較慢。需要更高品質時用本 checkpoint；高併發多使用者部署則用 35B-A3B 版本。

安全警告

此模型已移除安全過濾機制（abliterated），可能產生敏感、爭議性或不當內容。使用者須自行承擔所有風險與法律責任，並確保使用方式符合當地法規與倫理標準。不適用於公開或生產環境。

致謝

原始模型：Qwen/Qwen3.6-27B，Alibaba Qwen 團隊
去審查：huihui-ai
INT4 量化：YuYu1015，於 NVIDIA DGX Spark (GB10) 上完成
量化工具：Intel AutoRound
參考 Recipe：Lorbus/Qwen3.6-27B-int4-AutoRound — mtp.fc BF16 保留模式

Downloads last month: 632

Safetensors

Model size

6B params

Tensor type

I32

BF16

F16

Model tree for YuYu1015/Huihui-Qwen3.6-27B-abliterated-int4-AutoRound

Base model

Qwen/Qwen3.6-27B

Finetuned

huihui-ai/Huihui-Qwen3.6-27B-abliterated

Quantized

(21)

this model