Instructions to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3",
	filename="GLM-5.1-Abliterated-Dynamic-IQ3-340-00001-of-00008-00001-of-00018.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
# Run inference directly in the terminal:
llama-cli -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
# Run inference directly in the terminal:
llama-cli -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
# Run inference directly in the terminal:
./llama-cli -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
# Run inference directly in the terminal:
./build/bin/llama-cli -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3

Use Docker

docker model run hf.co/sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3

LM Studio
Jan
Ollama
How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with Ollama:
```
ollama run hf.co/sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
```

Unsloth Studio new

How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 to start chatting

Pi new

How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3

Run Hermes

hermes

Docker Model Runner
How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with Docker Model Runner:
```
docker model run hf.co/sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
```

Lemonade

How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3

Run and chat with the model

lemonade run user.GLM-5.1-Abliterated-Dynamic-IQ3-{{QUANT_TAG}}

List all available models

lemonade list

GLM-5.1-Abliterated-Dynamic-IQ3-340-GGUF

GGUF-quantized version of helixdouble/GLM-5.1-Abliterated (base: zai-org/GLM-5.1-FP8).

744B parameters MoE / 79 layers / 256 experts / 40B active / context length 202,752

⚠️ CRITICAL: Known Issue — Garbage Output on CUDA (As of 2026-05-18)

This model currently produces garbage output on all tested CUDA configurations. The build-time "smoke test" only measured token generation speed (tok/s) — output quality was never verified.

Symptom

Greedy decoding (temperature=0): always emits token 0 (!) repeatedly: !!!!!!!!!!!!!!!!!!!!
Sampling: random character garbage (e.g., H2%(@G&=6#9C4,')

Affected Configurations (all tested, all failed)

GPUs	ctx-size	Key flags	Result
6× RTX PRO 6000 Blackwell	4096	`-mla 3 -muge --merge-qkv -amb 512 -ctk q8_0 -ctv q8_0 --flash-attn on`	❌ Garbage
4× RTX PRO 6000 Blackwell	30000	`-ctk q8_0 -ctv q8_0`	❌ Garbage
4× RTX PRO 6000 Blackwell	various	`--cpu-moe`	❌ Garbage + slow
4× RTX PRO 6000 Blackwell	various	`GGML_CUDA_FORCE_MMQ=1`	❌ Garbage + slow
4× RTX PRO 6000 Blackwell	various	partial offload `-ngl 40`	❌ Garbage + slow
4× RTX PRO 6000 Blackwell	various	`-ctk f16 -ctv f16`	❌ Garbage
6× RTX PRO 6000 Blackwell	4096	`--no-mmap`	❌ Garbage
6× RTX PRO 6000 Blackwell	4096	fused ops disabled via source patch	❌ Still garbage

Root Cause

ik_llama.cpp Issue #1045: Unresolved bug — GLM-family models produce garbage on CUDA. Bisected to fused expert selection commits (28d3e63).
ik_llama.cpp Issue #1616: NaN logits on CUDA with GLM-5.1 when using Q4_0 KV cache.
GLM-DSA lacks GGML_PREC_F32 workarounds present in GLM4/GLM4_MOE.
Blackwell (SM120) architecture may exacerbate the issue.

Additional Findings

Q8_0 base model (747GB) crashes with segmentation fault on CPU-only inference (-ngl 0).
mainline llama.cpp cannot load this model — ik_llama.cpp uses extended ggml type 141 (IQ3_KS) not supported upstream.
Performance is excellent (~51 tok/s prompt eval, ~40 tok/s generation) — the model runs fast, just produces wrong tokens.

🔧 Call for Community Help

If you have successfully run GLM-5.1 (or GLM-4.5-Air) with IQ quantization on ik_llama.cpp CUDA, please share:

Exact commit hash and build flags
Launch command line
GPU architecture (compute capability)
Any source patches applied

Please comment on ik_llama.cpp Issue #1045 or open a PR to this repo.

Recommended Next Steps

Wait for ik_llama.cpp fix — Track Issue #1045.
Try bati.cpp v0.2.0+ — Official GLM-5.1 support is planned.
Consider Q4_K_M or Q5_K quantization — Non-IQ types may be less affected (unconfirmed).
Consider alternative models — DeepSeek V4, Qwen 3.6, etc.

Quantization Details

Item	Value
Quantization type	Dynamic IQ3 (IQ3_KS base + custom per-tensor)
Final size	347GB (18 shards)
Architecture	`glm-dsa`
Average BPW	3.950
Toolchain	bati.cpp (convert) → ik_llama.cpp (quantize)

Custom Quantization Strategy

High-importance tensors kept at high precision; routed experts heavily compressed:

Tensor Group	Quantization
Attention (attn_k/v/q/output)	q8_0
DSA indexer (proj/attn_k)	q8_0
First 3 dense blocks	iq6_k / iq5_ks
Shared experts	iq6_k / iq5_ks
Early/Late routed experts (down)	iq5_ks
Early/Late routed experts (gate/up)	iq4_ks
Middle routed experts (down)	iq4_ks
Middle routed experts (gate/up)	iq3_ks
output / token_embd	iq6_k / iq5_ks

Inference

⚠️ The following commands are provided for reference, but output quality is currently broken on all tested configurations. See "Known Issue" above.

Requirements

RAM: 512GB+ recommended
GPU VRAM: 4-GPU config recommended (96GB+ each)
Supported engines: ik_llama.cpp or bati.cpp
Storage: 350GB+ free for GGUF files

Inference Commands

1. Fast Config (Recommended when fixed)

Full GPU offload. Optimized for speed, stable up to ctx=32768.

CUDA_VISIBLE_DEVICES=0,1,2,3 ./build/bin/llama-server \
  -m /path/to/GLM-5.1-Abliterated-Dynamic-IQ3-340-00001-of-00018.gguf \
  --host 127.0.0.1 --port 8081 \
  --ctx-size 32768 \
  -ngl 999 -sm layer --device CUDA0,CUDA1,CUDA2,CUDA3 \
  -ctk q8_0 -ctv q8_0 \
  -t 24

2. Fallback Config — Long Context

CPU offload mix. Achieves ctx=65535. Use when long context is mandatory.

CUDA_VISIBLE_DEVICES=0,1,2,3 ./build/bin/llama-server \
  -m /path/to/GLM-5.1-Abliterated-Dynamic-IQ3-340-00001-of-00018.gguf \
  --host 127.0.0.1 --port 8081 \
  --ctx-size 65535 \
  -ngl 70 -sm layer --device CUDA0,CUDA1,CUDA2,CUDA3 \
  -ctk q4_0 -ctv q4_0 \
  -t 24

llama-cli (Chat Mode)

CUDA_VISIBLE_DEVICES=0,1,2,3 ./build/bin/llama-cli \
  -m /path/to/GLM-5.1-Abliterated-Dynamic-IQ3-340-00001-of-00018.gguf \
  --ctx-size 32768 \
  -ngl 999 -sm layer --device CUDA0,CUDA1,CUDA2,CUDA3 \
  -ctk q8_0 -ctv q8_0 \
  -t 24 -cnv

Benchmark (4× RTX PRO 6000 Blackwell 96GB, verified 2026-05-17)

Config	ctx	KV cache	GPU layers	Gen speed	Notes
Fast	32768	q8_0/q8_0	79/79	41.3 TPS	Speed-optimized
Fallback	65535	q4_0/q4_0	70/79	21.8 TPS	Long context
Reference	24576	q4_0/q4_0	79/79	~37 TPS	Historical

Fast config (-ngl 999): ctx=32768 is the limit. Beyond this, CUDA0 compute buffer OOM.
Fallback config (-ngl 70): Offloads last 9 layers to CPU to achieve ctx=65535. Speed drops ~47%.
Flash Attention is enabled by default in ik_llama.cpp (no explicit flag needed).

Build Report

See BUILD_REPORT.md for detailed build pipeline, environment, and the full list of tested configurations.

Credits

This model is based on:

Base model: zai-org/GLM-5.1-FP8 — Z.ai's 754B MoE model
Abliteration: helixdouble/GLM-5.1-Abliterated — Precision abliteration via 250 Optuna trials + healing LoRA
Abliteration reproduction dataset: helixdouble/glm-5.1-abliteration-trials-250
GGUF conversion & quantization tools:
- bati.cpp — GLM-DSA compatible converter
- ik_llama.cpp — Dynamic IQ3 quantization engine
Calibration data: heliXdouble bartowski_calibration_v5 + Lna-Lab additions

SHA256

# Generate with:
sha256sum *.gguf > SHA256SUMS.txt

License

AGPL-3.0 (following the base model and abliteration artifacts)

Downloads last month: 1,356

GGUF

Model size

754B params

Architecture

glm-dsa

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3

Base model

zai-org/GLM-5.1-FP8

Quantized

helixdouble/GLM-5.1-Abliterated

Quantized

(2)

this model