Instructions to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3", filename="GLM-5.1-Abliterated-Dynamic-IQ3-340-00001-of-00008-00001-of-00018.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 # Run inference directly in the terminal: llama-cli -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 # Run inference directly in the terminal: llama-cli -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 # Run inference directly in the terminal: ./llama-cli -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 # Run inference directly in the terminal: ./build/bin/llama-cli -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
Use Docker
docker model run hf.co/sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
- LM Studio
- Jan
- Ollama
How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with Ollama:
ollama run hf.co/sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
- Unsloth Studio new
How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 to start chatting
- Pi new
How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
Run Hermes
hermes
- Docker Model Runner
How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with Docker Model Runner:
docker model run hf.co/sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
- Lemonade
How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
Run and chat with the model
lemonade run user.GLM-5.1-Abliterated-Dynamic-IQ3-{{QUANT_TAG}}List all available models
lemonade list
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3# Run inference directly in the terminal:
llama-cli -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3# Run inference directly in the terminal:
./llama-cli -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3# Run inference directly in the terminal:
./build/bin/llama-cli -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3Use Docker
docker model run hf.co/sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3GLM-5.1-Abliterated-Dynamic-IQ3-340-GGUF
GGUF-quantized version of helixdouble/GLM-5.1-Abliterated (base: zai-org/GLM-5.1-FP8).
744B parameters MoE / 79 layers / 256 experts / 40B active / context length 202,752
โ ๏ธ CRITICAL: Known Issue โ Garbage Output on CUDA (As of 2026-05-18)
This model currently produces garbage output on all tested CUDA configurations. The build-time "smoke test" only measured token generation speed (tok/s) โ output quality was never verified.
Symptom
- Greedy decoding (
temperature=0): always emits token 0 (!) repeatedly:!!!!!!!!!!!!!!!!!!!! - Sampling: random character garbage (e.g.,
H2%(@G&=6#9C4,')
Affected Configurations (all tested, all failed)
| GPUs | ctx-size | Key flags | Result |
|---|---|---|---|
| 6ร RTX PRO 6000 Blackwell | 4096 | -mla 3 -muge --merge-qkv -amb 512 -ctk q8_0 -ctv q8_0 --flash-attn on |
โ Garbage |
| 4ร RTX PRO 6000 Blackwell | 30000 | -ctk q8_0 -ctv q8_0 |
โ Garbage |
| 4ร RTX PRO 6000 Blackwell | various | --cpu-moe |
โ Garbage + slow |
| 4ร RTX PRO 6000 Blackwell | various | GGML_CUDA_FORCE_MMQ=1 |
โ Garbage + slow |
| 4ร RTX PRO 6000 Blackwell | various | partial offload -ngl 40 |
โ Garbage + slow |
| 4ร RTX PRO 6000 Blackwell | various | -ctk f16 -ctv f16 |
โ Garbage |
| 6ร RTX PRO 6000 Blackwell | 4096 | --no-mmap |
โ Garbage |
| 6ร RTX PRO 6000 Blackwell | 4096 | fused ops disabled via source patch | โ Still garbage |
Root Cause
- ik_llama.cpp Issue #1045: Unresolved bug โ GLM-family models produce garbage on CUDA. Bisected to fused expert selection commits (
28d3e63). - ik_llama.cpp Issue #1616: NaN logits on CUDA with GLM-5.1 when using Q4_0 KV cache.
- GLM-DSA lacks
GGML_PREC_F32workarounds present in GLM4/GLM4_MOE. - Blackwell (SM120) architecture may exacerbate the issue.
Additional Findings
- Q8_0 base model (747GB) crashes with segmentation fault on CPU-only inference (
-ngl 0). - mainline llama.cpp cannot load this model โ ik_llama.cpp uses extended ggml type 141 (IQ3_KS) not supported upstream.
- Performance is excellent (~51 tok/s prompt eval, ~40 tok/s generation) โ the model runs fast, just produces wrong tokens.
๐ง Call for Community Help
If you have successfully run GLM-5.1 (or GLM-4.5-Air) with IQ quantization on ik_llama.cpp CUDA, please share:
- Exact commit hash and build flags
- Launch command line
- GPU architecture (compute capability)
- Any source patches applied
Please comment on ik_llama.cpp Issue #1045 or open a PR to this repo.
Recommended Next Steps
- Wait for ik_llama.cpp fix โ Track Issue #1045.
- Try bati.cpp v0.2.0+ โ Official GLM-5.1 support is planned.
- Consider Q4_K_M or Q5_K quantization โ Non-IQ types may be less affected (unconfirmed).
- Consider alternative models โ DeepSeek V4, Qwen 3.6, etc.
Quantization Details
| Item | Value |
|---|---|
| Quantization type | Dynamic IQ3 (IQ3_KS base + custom per-tensor) |
| Final size | 347GB (18 shards) |
| Architecture | glm-dsa |
| Average BPW | 3.950 |
| Toolchain | bati.cpp (convert) โ ik_llama.cpp (quantize) |
Custom Quantization Strategy
High-importance tensors kept at high precision; routed experts heavily compressed:
| Tensor Group | Quantization |
|---|---|
| Attention (attn_k/v/q/output) | q8_0 |
| DSA indexer (proj/attn_k) | q8_0 |
| First 3 dense blocks | iq6_k / iq5_ks |
| Shared experts | iq6_k / iq5_ks |
| Early/Late routed experts (down) | iq5_ks |
| Early/Late routed experts (gate/up) | iq4_ks |
| Middle routed experts (down) | iq4_ks |
| Middle routed experts (gate/up) | iq3_ks |
| output / token_embd | iq6_k / iq5_ks |
Inference
โ ๏ธ The following commands are provided for reference, but output quality is currently broken on all tested configurations. See "Known Issue" above.
Requirements
- RAM: 512GB+ recommended
- GPU VRAM: 4-GPU config recommended (96GB+ each)
- Supported engines: ik_llama.cpp or bati.cpp
- Storage: 350GB+ free for GGUF files
Inference Commands
1. Fast Config (Recommended when fixed)
Full GPU offload. Optimized for speed, stable up to ctx=32768.
CUDA_VISIBLE_DEVICES=0,1,2,3 ./build/bin/llama-server \
-m /path/to/GLM-5.1-Abliterated-Dynamic-IQ3-340-00001-of-00018.gguf \
--host 127.0.0.1 --port 8081 \
--ctx-size 32768 \
-ngl 999 -sm layer --device CUDA0,CUDA1,CUDA2,CUDA3 \
-ctk q8_0 -ctv q8_0 \
-t 24
2. Fallback Config โ Long Context
CPU offload mix. Achieves ctx=65535. Use when long context is mandatory.
CUDA_VISIBLE_DEVICES=0,1,2,3 ./build/bin/llama-server \
-m /path/to/GLM-5.1-Abliterated-Dynamic-IQ3-340-00001-of-00018.gguf \
--host 127.0.0.1 --port 8081 \
--ctx-size 65535 \
-ngl 70 -sm layer --device CUDA0,CUDA1,CUDA2,CUDA3 \
-ctk q4_0 -ctv q4_0 \
-t 24
llama-cli (Chat Mode)
CUDA_VISIBLE_DEVICES=0,1,2,3 ./build/bin/llama-cli \
-m /path/to/GLM-5.1-Abliterated-Dynamic-IQ3-340-00001-of-00018.gguf \
--ctx-size 32768 \
-ngl 999 -sm layer --device CUDA0,CUDA1,CUDA2,CUDA3 \
-ctk q8_0 -ctv q8_0 \
-t 24 -cnv
Benchmark (4ร RTX PRO 6000 Blackwell 96GB, verified 2026-05-17)
| Config | ctx | KV cache | GPU layers | Gen speed | Notes |
|---|---|---|---|---|---|
| Fast | 32768 | q8_0/q8_0 | 79/79 | 41.3 TPS | Speed-optimized |
| Fallback | 65535 | q4_0/q4_0 | 70/79 | 21.8 TPS | Long context |
| Reference | 24576 | q4_0/q4_0 | 79/79 | ~37 TPS | Historical |
- Fast config (
-ngl 999): ctx=32768 is the limit. Beyond this, CUDA0 compute buffer OOM. - Fallback config (
-ngl 70): Offloads last 9 layers to CPU to achieve ctx=65535. Speed drops ~47%. - Flash Attention is enabled by default in ik_llama.cpp (no explicit flag needed).
Build Report
See BUILD_REPORT.md for detailed build pipeline, environment, and the full list of tested configurations.
Credits
This model is based on:
- Base model: zai-org/GLM-5.1-FP8 โ Z.ai's 754B MoE model
- Abliteration: helixdouble/GLM-5.1-Abliterated โ Precision abliteration via 250 Optuna trials + healing LoRA
- Abliteration reproduction dataset: helixdouble/glm-5.1-abliteration-trials-250
- GGUF conversion & quantization tools:
- bati.cpp โ GLM-DSA compatible converter
- ik_llama.cpp โ Dynamic IQ3 quantization engine
- Calibration data: heliXdouble bartowski_calibration_v5 + Lna-Lab additions
SHA256
# Generate with:
sha256sum *.gguf > SHA256SUMS.txt
License
AGPL-3.0 (following the base model and abliteration artifacts)
- Downloads last month
- 1,356
We're not able to determine the quantization variants.
Install from brew
# Start a local OpenAI-compatible server with a web UI: llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3# Run inference directly in the terminal: llama-cli -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3