zai-org/GLM-5.2 optimized for running on a Mac Studio M3 512.

  • A mixed-precision quant that balances speed, memory, and accuracy.
  • 4-bit baseline with important layers at higher precision.
  • Fits into ~420 GB memory, leaving enough room for a smaller utility model.

Usage

NOTE: Run with https://github.com/ml-explore/mlx-lm/pull/1410 until the PR is merged.

# Start server at http://localhost:8080/v1/chat/completions
uvx --from mlx-lm mlx_lm.server \
  --host 127.0.0.1 \
  --port 8080 \
  --model spicyneuron/GLM-5.2-MLX-4.5bit

Benchmarks

metric this model
bpw 4.535
base memory 392.454
peak memory (1024/512) 422.787
prompt tok/s (1024) 194.114 卤 0.079
gen tok/s (512) 17.781 卤 0.028
kl mean* 0.049 卤 0.002
kl p95 0.113 卤 0.002
perplexity 4.642 卤 0.036
arc_challenge 0.690 卤 0.021
hellaswag 0.780 卤 0.019

* KL calculated against the largest quant I could run locally (~5.3 bit). Real KL is against FP will be higher.

Methodology

Quantized with a mlx-lm fork. MLX quantization options differ than llama.cpp, but the principles are the same:

  • Sensitive layers like MoE routing, attention, and output embeddings get higher precision
  • More tolerant layers like MoE experts get lower precision
Downloads last month
3,157
Safetensors
Model size
743B params
Tensor type
BF16
U32
F32
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for spicyneuron/GLM-5.2-MLX-4.5bit

Base model

zai-org/GLM-5.2
Quantized
(58)
this model