How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
# Run inference directly in the terminal:
llama-cli -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
# Run inference directly in the terminal:
llama-cli -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
# Run inference directly in the terminal:
./llama-cli -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
# Run inference directly in the terminal:
./build/bin/llama-cli -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
Use Docker
docker model run hf.co/sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
Quick Links

GLM-5.1-Abliterated-Dynamic-IQ3-340-GGUF

GGUF-quantized version of helixdouble/GLM-5.1-Abliterated (base: zai-org/GLM-5.1-FP8).

744B parameters MoE / 79 layers / 256 experts / 40B active / context length 202,752


โš ๏ธ CRITICAL: Known Issue โ€” Garbage Output on CUDA (As of 2026-05-18)

This model currently produces garbage output on all tested CUDA configurations. The build-time "smoke test" only measured token generation speed (tok/s) โ€” output quality was never verified.

Symptom

  • Greedy decoding (temperature=0): always emits token 0 (!) repeatedly: !!!!!!!!!!!!!!!!!!!!
  • Sampling: random character garbage (e.g., H2%(@G&=6#9C4,')

Affected Configurations (all tested, all failed)

GPUs ctx-size Key flags Result
6ร— RTX PRO 6000 Blackwell 4096 -mla 3 -muge --merge-qkv -amb 512 -ctk q8_0 -ctv q8_0 --flash-attn on โŒ Garbage
4ร— RTX PRO 6000 Blackwell 30000 -ctk q8_0 -ctv q8_0 โŒ Garbage
4ร— RTX PRO 6000 Blackwell various --cpu-moe โŒ Garbage + slow
4ร— RTX PRO 6000 Blackwell various GGML_CUDA_FORCE_MMQ=1 โŒ Garbage + slow
4ร— RTX PRO 6000 Blackwell various partial offload -ngl 40 โŒ Garbage + slow
4ร— RTX PRO 6000 Blackwell various -ctk f16 -ctv f16 โŒ Garbage
6ร— RTX PRO 6000 Blackwell 4096 --no-mmap โŒ Garbage
6ร— RTX PRO 6000 Blackwell 4096 fused ops disabled via source patch โŒ Still garbage

Root Cause

  • ik_llama.cpp Issue #1045: Unresolved bug โ€” GLM-family models produce garbage on CUDA. Bisected to fused expert selection commits (28d3e63).
  • ik_llama.cpp Issue #1616: NaN logits on CUDA with GLM-5.1 when using Q4_0 KV cache.
  • GLM-DSA lacks GGML_PREC_F32 workarounds present in GLM4/GLM4_MOE.
  • Blackwell (SM120) architecture may exacerbate the issue.

Additional Findings

  • Q8_0 base model (747GB) crashes with segmentation fault on CPU-only inference (-ngl 0).
  • mainline llama.cpp cannot load this model โ€” ik_llama.cpp uses extended ggml type 141 (IQ3_KS) not supported upstream.
  • Performance is excellent (~51 tok/s prompt eval, ~40 tok/s generation) โ€” the model runs fast, just produces wrong tokens.

๐Ÿ”ง Call for Community Help

If you have successfully run GLM-5.1 (or GLM-4.5-Air) with IQ quantization on ik_llama.cpp CUDA, please share:

  1. Exact commit hash and build flags
  2. Launch command line
  3. GPU architecture (compute capability)
  4. Any source patches applied

Please comment on ik_llama.cpp Issue #1045 or open a PR to this repo.

Recommended Next Steps

  1. Wait for ik_llama.cpp fix โ€” Track Issue #1045.
  2. Try bati.cpp v0.2.0+ โ€” Official GLM-5.1 support is planned.
  3. Consider Q4_K_M or Q5_K quantization โ€” Non-IQ types may be less affected (unconfirmed).
  4. Consider alternative models โ€” DeepSeek V4, Qwen 3.6, etc.

Quantization Details

Item Value
Quantization type Dynamic IQ3 (IQ3_KS base + custom per-tensor)
Final size 347GB (18 shards)
Architecture glm-dsa
Average BPW 3.950
Toolchain bati.cpp (convert) โ†’ ik_llama.cpp (quantize)

Custom Quantization Strategy

High-importance tensors kept at high precision; routed experts heavily compressed:

Tensor Group Quantization
Attention (attn_k/v/q/output) q8_0
DSA indexer (proj/attn_k) q8_0
First 3 dense blocks iq6_k / iq5_ks
Shared experts iq6_k / iq5_ks
Early/Late routed experts (down) iq5_ks
Early/Late routed experts (gate/up) iq4_ks
Middle routed experts (down) iq4_ks
Middle routed experts (gate/up) iq3_ks
output / token_embd iq6_k / iq5_ks

Inference

โš ๏ธ The following commands are provided for reference, but output quality is currently broken on all tested configurations. See "Known Issue" above.

Requirements

  • RAM: 512GB+ recommended
  • GPU VRAM: 4-GPU config recommended (96GB+ each)
  • Supported engines: ik_llama.cpp or bati.cpp
  • Storage: 350GB+ free for GGUF files

Inference Commands

1. Fast Config (Recommended when fixed)

Full GPU offload. Optimized for speed, stable up to ctx=32768.

CUDA_VISIBLE_DEVICES=0,1,2,3 ./build/bin/llama-server \
  -m /path/to/GLM-5.1-Abliterated-Dynamic-IQ3-340-00001-of-00018.gguf \
  --host 127.0.0.1 --port 8081 \
  --ctx-size 32768 \
  -ngl 999 -sm layer --device CUDA0,CUDA1,CUDA2,CUDA3 \
  -ctk q8_0 -ctv q8_0 \
  -t 24

2. Fallback Config โ€” Long Context

CPU offload mix. Achieves ctx=65535. Use when long context is mandatory.

CUDA_VISIBLE_DEVICES=0,1,2,3 ./build/bin/llama-server \
  -m /path/to/GLM-5.1-Abliterated-Dynamic-IQ3-340-00001-of-00018.gguf \
  --host 127.0.0.1 --port 8081 \
  --ctx-size 65535 \
  -ngl 70 -sm layer --device CUDA0,CUDA1,CUDA2,CUDA3 \
  -ctk q4_0 -ctv q4_0 \
  -t 24

llama-cli (Chat Mode)

CUDA_VISIBLE_DEVICES=0,1,2,3 ./build/bin/llama-cli \
  -m /path/to/GLM-5.1-Abliterated-Dynamic-IQ3-340-00001-of-00018.gguf \
  --ctx-size 32768 \
  -ngl 999 -sm layer --device CUDA0,CUDA1,CUDA2,CUDA3 \
  -ctk q8_0 -ctv q8_0 \
  -t 24 -cnv

Benchmark (4ร— RTX PRO 6000 Blackwell 96GB, verified 2026-05-17)

Config ctx KV cache GPU layers Gen speed Notes
Fast 32768 q8_0/q8_0 79/79 41.3 TPS Speed-optimized
Fallback 65535 q4_0/q4_0 70/79 21.8 TPS Long context
Reference 24576 q4_0/q4_0 79/79 ~37 TPS Historical
  • Fast config (-ngl 999): ctx=32768 is the limit. Beyond this, CUDA0 compute buffer OOM.
  • Fallback config (-ngl 70): Offloads last 9 layers to CPU to achieve ctx=65535. Speed drops ~47%.
  • Flash Attention is enabled by default in ik_llama.cpp (no explicit flag needed).

Build Report

See BUILD_REPORT.md for detailed build pipeline, environment, and the full list of tested configurations.


Credits

This model is based on:


SHA256

# Generate with:
sha256sum *.gguf > SHA256SUMS.txt

License

AGPL-3.0 (following the base model and abliteration artifacts)


Downloads last month
1,356
GGUF
Model size
754B params
Architecture
glm-dsa
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3

Quantized
(2)
this model