Instructions to use mlx-community/GLM-4-32B-0414-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mlx-community/GLM-4-32B-0414-4bit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/GLM-4-32B-0414-4bit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

How to use mlx-community/GLM-4-32B-0414-4bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "mlx-community/GLM-4-32B-0414-4bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "mlx-community/GLM-4-32B-0414-4bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use mlx-community/GLM-4-32B-0414-4bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "mlx-community/GLM-4-32B-0414-4bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default mlx-community/GLM-4-32B-0414-4bit

Run Hermes

hermes

MLX LM

How to use mlx-community/GLM-4-32B-0414-4bit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "mlx-community/GLM-4-32B-0414-4bit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "mlx-community/GLM-4-32B-0414-4bit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "mlx-community/GLM-4-32B-0414-4bit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

2048 context length?

by huyzed - opened Apr 26, 2025

Discussion

huyzed

MLX Community org Apr 26, 2025

In LM Studio, the context window is set to a max of 2048.

Is that expected? It seems quite low compared to all the other recent models I've worked with.

Goekdeniz-Guelmez

MLX Community org Apr 27, 2025

Yea I know its weird but no this model along with the others has a context of 32768

huyzed

MLX Community org Apr 29, 2025

@Goekdeniz-Guelmez any idea how to override this max in LM Studio?

deleted

Apr 30, 2025

•

edited Apr 30, 2025

Edit: Seems this doesn't work anymore.

@Goekdeniz-Guelmez any idea how to override this max in LM Studio?

In config.json file in a model folder change the value of line "max_position_embeddings" to "max_position_embeddings": 32768,

R-I-n-g-o

May 1, 2025

tl;dr
The model is fine – LM Studio guesses 2048 for MLX builds. Set the Context Length manually (gear icon ▶ 32768)

Why “Max context 2048”?

LM Studio’s MLX backend (the one that loads Apple-Silicon-friendly .npz/4-bit weights) in some versions of mlx-lm the indexer can’t read long-context hints from GLM configs, so it falls back to 2048 and prints that in the UI. The same issue was reported for other MLX conversions (Gemma-3, etc.) (Gemma 3 Context Window capped at 4096 · Issue #48 · ml-explore/mlx-lm · GitHub)

Good news it that seems to be just cosmetic – override it at load time. 

The quick fix with no file editing

In My Models ▸ GLM-4-32B-0414-4bit click ⚙︎ Load settings.
Change Context length from 2048 → 32768 (or whatever your VRAM allows). 32 k @ 4-bit on a 32 B model is ~18 GiB just for the KV-cache – start lower if you’re on an M-series with <64 GB unified memory.
Press Save as default → Load model.  It might say “max 2048” in some places, but generation runs past that. I had it make a bunch of scripts for me and it didn't even get close to the context window filling up.

If loading with the REST SDK just add the parameter also:

{
  model: "mlx-community/GLM-4-32B-0414-4bit",
  loadConfig: {
    contextLength: 32768,
    ropeFrequencyBase: 1_000_000,   // optional but helps with >8 k
    ropeFrequencyScale: 1.0
  }
}

had done some other stuff but that seems to have fixed it for me. It didn't have a bos_token in the config and the eos_token was two values, also there was double quant stuff so also changed that. Im including my jacked up config if the above doesn't work by itself. But believe the above is what actually got it to work and if not my config is below.

{
  "architectures": ["Glm4ForCausalLM"],
  "attention_bias": false,
  "attention_dropout": 0.0,

  "bos_token_id": 151329,
  "eos_token_id": 151336,

  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 6144,
  "initializer_range": 0.02,
  "intermediate_size": 23040,
  "max_position_embeddings": 32768,
  "model_type": "glm4",
  "num_attention_heads": 48,
  "num_hidden_layers": 61,
  "num_key_value_heads": 2,
  "pad_token_id": 151329,
  "partial_rotary_factor": 0.5,

  "quantization": {
    "group_size": 64,
    "bits": 4
  },

  "rms_norm_eps": 1e-05,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.0",
  "use_cache": true,
  "vocab_size": 151552,

  "additional_eos_token_ids": [151329, 151338]
}

R-I-n-g-o

May 1, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment