Instructions to use zai-org/LongWriter-glm4-9b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zai-org/LongWriter-glm4-9b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="zai-org/LongWriter-glm4-9b", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("zai-org/LongWriter-glm4-9b", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use zai-org/LongWriter-glm4-9b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "zai-org/LongWriter-glm4-9b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/LongWriter-glm4-9b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/zai-org/LongWriter-glm4-9b
- SGLang
How to use zai-org/LongWriter-glm4-9b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "zai-org/LongWriter-glm4-9b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/LongWriter-glm4-9b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "zai-org/LongWriter-glm4-9b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/LongWriter-glm4-9b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use zai-org/LongWriter-glm4-9b with Docker Model Runner:
docker model run hf.co/zai-org/LongWriter-glm4-9b
Getting error while using on CPU
/bin/python3 "/media/mohi/Disk 1/Solutyics/GLM_4_Testing/task1.py"
╭─ /media/mohi/Disk 1/Solutyics/GLM_4_Testing ···························································································· ✔ ─╮
╰─ /bin/python3 "/media/mohi/Disk 1/Solutyics/GLM_4_Testing/task1.py" ─╯
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 10.98it/s]
Traceback (most recent call last):
File "/media/mohi/Disk 1/Solutyics/GLM_4_Testing/task1.py", line 20, in
output = model.generate(
File "/home/mohi/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/mohi/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 1989, in generate
result = self._sample(
File "/home/mohi/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2932, in _sample
outputs = self(**model_inputs, return_dict=True)
File "/home/mohi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/mohi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mohi/.cache/huggingface/modules/transformers_modules/THUDM/LongWriter-glm4-9b/81b025e373d163efd7908a787b3fb907424c6184/modeling_chatglm.py", line 801, in forward
transformer_outputs = self.transformer(
File "/home/mohi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/mohi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mohi/.cache/huggingface/modules/transformers_modules/THUDM/LongWriter-glm4-9b/81b025e373d163efd7908a787b3fb907424c6184/modeling_chatglm.py", line 707, in forward
hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
File "/home/mohi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/mohi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mohi/.cache/huggingface/modules/transformers_modules/THUDM/LongWriter-glm4-9b/81b025e373d163efd7908a787b3fb907424c6184/modeling_chatglm.py", line 551, in forward
layer_ret = layer(
File "/home/mohi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/mohi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mohi/.cache/huggingface/modules/transformers_modules/THUDM/LongWriter-glm4-9b/81b025e373d163efd7908a787b3fb907424c6184/modeling_chatglm.py", line 454, in forward
attention_output, kv_cache = self.self_attention(
File "/home/mohi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/mohi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mohi/.cache/huggingface/modules/transformers_modules/THUDM/LongWriter-glm4-9b/81b025e373d163efd7908a787b3fb907424c6184/modeling_chatglm.py", line 351, in forward
context_layer = self.core_attention(query_layer, key_layer, value_layer, attention_mask)
File "/home/mohi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/mohi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mohi/.cache/huggingface/modules/transformers_modules/THUDM/LongWriter-glm4-9b/81b025e373d163efd7908a787b3fb907424c6184/modeling_chatglm.py", line 211, in forward
context_layer = flash_attn_unpadded_func(
TypeError: 'NoneType' object is not callable
╭─ /media/mohi/Disk 1/Solutyics/GLM_4_Testing ············································································ 1 ✘ took 11s ─╮
╰─ pip install flash-attn ─╯
Defaulting to user installation because normal site-packages is not writeable
Collecting flash-attn
Using cached flash_attn-2.6.3.tar.gz (2.6 MB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [20 lines of output]
fatal: not a git repository (or any of the parent directories): .git
/tmp/pip-install-dt3l45za/flash-attn_76bb505f607b4d9783ee43defc787cf6/setup.py:95: UserWarning: flash_attn was requested, but nvcc was not found. Are you sure your environment has nvcc available? If you're installing within a container from https://hub.docker.com/r/pytorch/pytorch, only images whose names contain 'devel' will provide nvcc.
warnings.warn(
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "/tmp/pip-install-dt3l45za/flash-attn_76bb505f607b4d9783ee43defc787cf6/setup.py", line 179, in
CUDAExtension(
File "/home/mohi/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1077, in CUDAExtension
library_dirs += library_paths(cuda=True)
File "/home/mohi/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1204, in library_paths
if (not os.path.exists(_join_cuda_home(lib_dir)) and
File "/home/mohi/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2419, in _join_cuda_home
raise OSError('CUDA_HOME environment variable is not set. '
OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
torch.__version__ = 2.3.0+cu121
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
How we can fix that error
Hi, sorry for bringing this inconvenience. Currently our code relies on FlashAttention2 while FlashAttention2 only supports GPU environment. We will soon (hopefully by the end of this week) update the code to remove the reliance on FlashAttention2. Please stay tuned.
Good news! We've updated the modeling_chatglm.py to get rid of the dependency on FlashAttention2. Now I believe you can successfully run it on CPU.