Instructions to use mlx-community/GLM-4-32B-0414-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/GLM-4-32B-0414-4bit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("mlx-community/GLM-4-32B-0414-4bit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi
How to use mlx-community/GLM-4-32B-0414-4bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "mlx-community/GLM-4-32B-0414-4bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "mlx-community/GLM-4-32B-0414-4bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use mlx-community/GLM-4-32B-0414-4bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "mlx-community/GLM-4-32B-0414-4bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default mlx-community/GLM-4-32B-0414-4bit
Run Hermes
hermes
- MLX LM
How to use mlx-community/GLM-4-32B-0414-4bit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "mlx-community/GLM-4-32B-0414-4bit"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "mlx-community/GLM-4-32B-0414-4bit" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mlx-community/GLM-4-32B-0414-4bit", "messages": [ {"role": "user", "content": "Hello"} ] }'
2048 context length?
In LM Studio, the context window is set to a max of 2048.
Is that expected? It seems quite low compared to all the other recent models I've worked with.
Yea I know its weird but no this model along with the others has a context of 32768
Edit: Seems this doesn't work anymore.
@Goekdeniz-Guelmez any idea how to override this max in LM Studio?
In config.json file in a model folder change the value of line "max_position_embeddings" to "max_position_embeddings": 32768,
tl;dr
The model is fine – LM Studio guesses 2048 for MLX builds. Set the Context Length manually (gear icon ▶ 32768)
Why “Max context 2048”?
- LM Studio’s MLX backend (the one that loads Apple-Silicon-friendly .npz/4-bit weights) in some versions of mlx-lm the indexer can’t read long-context hints from GLM configs, so it falls back to 2048 and prints that in the UI. The same issue was reported for other MLX conversions (Gemma-3, etc.) (Gemma 3 Context Window capped at 4096 · Issue #48 · ml-explore/mlx-lm · GitHub)
Good news it that seems to be just cosmetic – override it at load time.
The quick fix with no file editing
- In My Models ▸ GLM-4-32B-0414-4bit click ⚙︎ Load settings.
- Change Context length from 2048 → 32768 (or whatever your VRAM allows). 32 k @ 4-bit on a 32 B model is ~18 GiB just for the KV-cache – start lower if you’re on an M-series with <64 GB unified memory.
- Press Save as default → Load model. It might say “max 2048” in some places, but generation runs past that. I had it make a bunch of scripts for me and it didn't even get close to the context window filling up.
If loading with the REST SDK just add the parameter also:
{
model: "mlx-community/GLM-4-32B-0414-4bit",
loadConfig: {
contextLength: 32768,
ropeFrequencyBase: 1_000_000, // optional but helps with >8 k
ropeFrequencyScale: 1.0
}
}
had done some other stuff but that seems to have fixed it for me. It didn't have a bos_token in the config and the eos_token was two values, also there was double quant stuff so also changed that. Im including my jacked up config if the above doesn't work by itself. But believe the above is what actually got it to work and if not my config is below.
{
"architectures": ["Glm4ForCausalLM"],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 151329,
"eos_token_id": 151336,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 6144,
"initializer_range": 0.02,
"intermediate_size": 23040,
"max_position_embeddings": 32768,
"model_type": "glm4",
"num_attention_heads": 48,
"num_hidden_layers": 61,
"num_key_value_heads": 2,
"pad_token_id": 151329,
"partial_rotary_factor": 0.5,
"quantization": {
"group_size": 64,
"bits": 4
},
"rms_norm_eps": 1e-05,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.43.0",
"use_cache": true,
"vocab_size": 151552,
"additional_eos_token_ids": [151329, 151338]
}

