Instructions to use avlp12/GLM-5.1-Alis-MLX-Dynamic-2.7bpw with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use avlp12/GLM-5.1-Alis-MLX-Dynamic-2.7bpw with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("avlp12/GLM-5.1-Alis-MLX-Dynamic-2.7bpw") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use avlp12/GLM-5.1-Alis-MLX-Dynamic-2.7bpw with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "avlp12/GLM-5.1-Alis-MLX-Dynamic-2.7bpw"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "avlp12/GLM-5.1-Alis-MLX-Dynamic-2.7bpw" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use avlp12/GLM-5.1-Alis-MLX-Dynamic-2.7bpw with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "avlp12/GLM-5.1-Alis-MLX-Dynamic-2.7bpw"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default avlp12/GLM-5.1-Alis-MLX-Dynamic-2.7bpw
Run Hermes
hermes
- MLX LM
How to use avlp12/GLM-5.1-Alis-MLX-Dynamic-2.7bpw with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "avlp12/GLM-5.1-Alis-MLX-Dynamic-2.7bpw"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "avlp12/GLM-5.1-Alis-MLX-Dynamic-2.7bpw" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "avlp12/GLM-5.1-Alis-MLX-Dynamic-2.7bpw", "messages": [ {"role": "user", "content": "Hello"} ] }'
🚨 2026-04-16 Emergency Patch Applied
Z.ai official
chat_template.jinjamodified version has been applied to this repo. (commit: 2026-04-16)
- Issue: When using tool calling in vLLM/SGLang, the framework converts tool content into an array like
[{"type": "text", "text": "..."}], but the original template handled only strings so the tool output was rendered empty → the model entered an infinite loop repeatedly calling the same tool.- Impact scope: All GLM-5.1–based models deployed with vLLM or SGLang.
- Resolution: Replaced
chat_template.jinjawith the official patched version fromzai-org/GLM-5.1. Weights (quantization) are unaffected; quality remains the same. - Source: @Zai_org announcement
Local inference (mlx-lm, oMLX) users will have the patch applied automatically, so no action is required.
GLM-5.1-Alis-MLX-Dynamic-2.7bpw
Unsloth Dynamic 2.0-style per-tensor mixed-precision quantization of zai-org/GLM-5.1 for Apple Silicon via MLX.
| Metric | Value |
|---|---|
| Base model | zai-org/GLM-5.1 (754B MoE, 40B active) |
| Bits per weight | 2.681 |
| Peak memory | 249 GB |
| Generation speed | 18.35 tok/s (M3 Ultra 512GB) |
| Quantization time | 3.9 minutes |
| Format | MLX safetensors |
| License | MIT (same as base model) |
Why This Model
GLM-5.1 is the #1 open-source model on SWE-Bench Pro (58.4) and excels at long-horizon agentic coding tasks. However, at 1.5TB in BF16, it requires quantization to run locally.
This quantization applies Unsloth Dynamic 2.0-style per-tensor bit allocation natively in MLX:
- Critical layers (embeddings, attention, routers) get higher precision (5-8 bit)
- MoE expert weights (93%+ of parameters) get aggressive compression (2-3 bit)
- Result: near-Unsloth quality at 1.5× faster speed than GGUF on Apple Silicon
Comparison
| Unsloth GGUF IQ2_M | This model (MLX) | spicyneuron 2.9bit | MLX uniform 4bit | |
|---|---|---|---|---|
| BPW | ~2.7 | 2.681 | 2.906 | ~4.5 |
| Speed (M3 Ultra) | ~12 tok/s | 18.35 tok/s | 19.5 tok/s | 15.4 tok/s |
| Memory | ~255 GB | 249 GB | 272 GB | ~420 GB |
| Format | GGUF (llama.cpp) | MLX native | MLX native | MLX native |
Benchmarks
All benchmarks run on Mac Studio M3 Ultra 512GB under identical conditions to enable fair comparison with spicyneuron/GLM-5.1-MLX-2.9bit.
Methodology
mlx_lm.perplexity --sequence-length 2048 --seed 123 --num-samples 50
mlx_lm.benchmark --prompt-tokens 1024 --generation-tokens 512 --num-trials 5
mlx_lm.evaluate --tasks hellaswag --seed 123 --num-shots 0 --limit 2000
mlx_lm.evaluate --tasks piqa --seed 123 --num-shots 0 --limit 2000
mlx_lm.evaluate --tasks winogrande --seed 123 --num-shots 0 --limit 2000
Results
| Metric | This model (2.681 bpw) | spicyneuron (2.906 bpw) |
|---|---|---|
| Perplexity ↓ | 4.165 ± 0.036 | 4.118 ± 0.016 |
| HellaSwag (acc_norm) ↑ | 0.606 ± 0.011 | 0.590 ± 0.011 |
| PIQA (acc) ↑ | 0.796 ± 0.009 | 0.794 ± 0.009 |
| WinoGrande (acc) ↑ | 0.660 ± 0.013 | 0.695 ± 0.013 |
| Prompt tok/s ↑ | 193.7 | 194.2 |
| Generation tok/s ↑ | 18.35 | 19.53 |
| Peak memory (GB) ↓ | 251 | 272 |
Summary: At 8% lower BPW, this model wins HellaSwag and PIQA, ties on perplexity (Δ0.047), saves 21 GB memory, with WinoGrande as the only notable trade-off. The aggressive 2-bit MoE expert compression preserves reasoning quality while maximizing memory efficiency for 512 GB systems.
Quantization Recipe
Per-tensor mixed-precision using mlx_lm.convert() with custom quant_predicate:
| Tensor Category | Normal Layers | Sensitive Layers (0-4, 73-77) |
|---|---|---|
| embed_tokens | 8-bit | 8-bit |
| lm_head | 6-bit | 6-bit |
| attention q/k/v | 5-bit | 6-bit |
| attention o_proj | 6-bit | 8-bit |
| DSA indexer | 6-bit | 6-bit |
| MoE router (gate) | 8-bit | 8-bit |
| MoE routed experts | 2-bit | 3-bit |
| Shared experts | 3-bit | 4-bit |
| Dense MLP (L0-2) | 4-bit | 5-bit |
Key design decisions:
- No
isinstance()filter — MoE experts areDeepseekV32MoEmodules, notnn.Linear - First/last 5 layers boosted — 10× more sensitive (consistent with oQ, JANG, Moonglade findings)
- MoE router always 8-bit — routing accuracy = overall model quality
- o_proj higher precision — no preceding norm layer, AWQ correction impossible
Usage
mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("avlp12/GLM-5.1-Alis-MLX-Dynamic-2.7bpw")
messages = [{"role": "user", "content": "Write a Python snake game using pygame"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
response = generate(model, tokenizer, prompt=prompt, max_tokens=2000, verbose=True)
mlx-lm server (OpenAI-compatible API)
python3 -m mlx_lm.server --model avlp12/GLM-5.1-Alis-MLX-Dynamic-2.7bpw --port 1237
oMLX
Place in your oMLX models directory and refresh. Recommended settings:
- Reasoning Parser:
deepseek_r1 - Temperature: 0.7, Top P: 0.95
- TurboQuant KV Cache: ON (4-bit)
- Index Cache: ON (DSA optimization)
- CTX Window: 131072
Hardware Requirements
| Config | Works? | Notes |
|---|---|---|
| M3 Ultra 512GB | ✅ | 249GB used, 260GB headroom |
| M3 Ultra 192GB | ❌ | Not enough memory |
| M4 Max 128GB | ❌ | Not enough memory |
Quantization Reproduction
from collections import Counter
from mlx_lm import convert
import mlx.nn as nn
NUM_LAYERS = 78
SENSITIVE = set(range(5)) | set(range(73, 78))
stats = Counter()
def predicate(path, module):
if "norm" in path.lower():
return False
layer_num = None
parts = path.split(".")
for j, p in enumerate(parts):
if p == "layers" and j + 1 < len(parts):
try:
layer_num = int(parts[j + 1])
except ValueError:
pass
sensitive = layer_num is not None and layer_num in SENSITIVE
if "embed_tokens" in path: bits = 8
elif "lm_head" in path: bits = 6
elif "q_a_proj" in path or "q_b_proj" in path:
bits = 6 if sensitive else 5
elif "kv_a_proj_with_mqa" in path: bits = 6 if sensitive else 5
elif "o_proj" in path: bits = 8 if sensitive else 6
elif "indexer" in path: bits = 6
elif "mlp" in path and "shared_experts" not in path and layer_num is not None and layer_num >= 3:
bits = 3 if sensitive else 2
elif "shared_experts" in path: bits = 4 if sensitive else 3
elif "mlp" in path and layer_num is not None and layer_num < 3:
bits = 5 if sensitive else 4
else: bits = 4
stats[bits] += 1
return {"bits": bits, "group_size": 64}
convert(
hf_path="zai-org/GLM-5.1",
mlx_path="./GLM-5.1-Alis-MLX-Dynamic-2.7bpw",
quantize=True, q_group_size=64, q_bits=4,
quant_predicate=predicate,
)
Credits
- Z.ai / Zhipu AI — GLM-5.1 base model (MIT license)
- Unsloth — Dynamic 2.0 per-tensor quantization methodology
- Apple MLX — Framework and
quant_predicateAPI - Moonglade/Brooooooklyn — Unsloth→MLX porting precedent
- oMLX/oQ — MoE-aware quantization insights
- JANG — MLX mixed-precision quantization reference
Citation
@misc{glm5.1-alis-mlx,
title={GLM-5.1-Alis-MLX-Dynamic-2.7bpw: Unsloth-style per-tensor quantization for Apple Silicon},
author={Alis (avlp12)},
year={2026},
url={https://huggingface.co/avlp12/GLM-5.1-Alis-MLX-Dynamic-2.7bpw}
}
- Downloads last month
- 735
2-bit
Model tree for avlp12/GLM-5.1-Alis-MLX-Dynamic-2.7bpw
Base model
zai-org/GLM-5.1