Instructions to use zai-org/LongWriter-glm4-9b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/LongWriter-glm4-9b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zai-org/LongWriter-glm4-9b", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("zai-org/LongWriter-glm4-9b", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use zai-org/LongWriter-glm4-9b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/LongWriter-glm4-9b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/LongWriter-glm4-9b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/LongWriter-glm4-9b

SGLang

How to use zai-org/LongWriter-glm4-9b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/LongWriter-glm4-9b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/LongWriter-glm4-9b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/LongWriter-glm4-9b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/LongWriter-glm4-9b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use zai-org/LongWriter-glm4-9b with Docker Model Runner:
```
docker model run hf.co/zai-org/LongWriter-glm4-9b
```

Getting error while using on CPU

by mohi0170 - opened Aug 16, 2024

Discussion

mohi0170

Aug 16, 2024

/bin/python3 "/media/mohi/Disk 1/Solutyics/GLM_4_Testing/task1.py"
╭─    /media/mohi/Disk 1/Solutyics/GLM_4_Testing ···························································································· ✔ ─╮
╰─ /bin/python3 "/media/mohi/Disk 1/Solutyics/GLM_4_Testing/task1.py" ─╯
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 10.98it/s]
Traceback (most recent call last):
File "/media/mohi/Disk 1/Solutyics/GLM_4_Testing/task1.py", line 20, in
output = model.generate(
File "/home/mohi/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/mohi/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 1989, in generate
result = self._sample(
File "/home/mohi/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2932, in _sample
outputs = self(**model_inputs, return_dict=True)
File "/home/mohi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/mohi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mohi/.cache/huggingface/modules/transformers_modules/THUDM/LongWriter-glm4-9b/81b025e373d163efd7908a787b3fb907424c6184/modeling_chatglm.py", line 801, in forward
transformer_outputs = self.transformer(
File "/home/mohi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/mohi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mohi/.cache/huggingface/modules/transformers_modules/THUDM/LongWriter-glm4-9b/81b025e373d163efd7908a787b3fb907424c6184/modeling_chatglm.py", line 707, in forward
hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
File "/home/mohi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/mohi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mohi/.cache/huggingface/modules/transformers_modules/THUDM/LongWriter-glm4-9b/81b025e373d163efd7908a787b3fb907424c6184/modeling_chatglm.py", line 551, in forward
layer_ret = layer(
File "/home/mohi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/mohi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mohi/.cache/huggingface/modules/transformers_modules/THUDM/LongWriter-glm4-9b/81b025e373d163efd7908a787b3fb907424c6184/modeling_chatglm.py", line 454, in forward
attention_output, kv_cache = self.self_attention(
File "/home/mohi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/mohi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mohi/.cache/huggingface/modules/transformers_modules/THUDM/LongWriter-glm4-9b/81b025e373d163efd7908a787b3fb907424c6184/modeling_chatglm.py", line 351, in forward
context_layer = self.core_attention(query_layer, key_layer, value_layer, attention_mask)
File "/home/mohi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/mohi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mohi/.cache/huggingface/modules/transformers_modules/THUDM/LongWriter-glm4-9b/81b025e373d163efd7908a787b3fb907424c6184/modeling_chatglm.py", line 211, in forward
context_layer = flash_attn_unpadded_func(
TypeError: 'NoneType' object is not callable

╭─    /media/mohi/Disk 1/Solutyics/GLM_4_Testing ············································································ 1 ✘  took 11s  ─╮
╰─ pip install flash-attn ─╯

Defaulting to user installation because normal site-packages is not writeable
Collecting flash-attn
Using cached flash_attn-2.6.3.tar.gz (2.6 MB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [20 lines of output]
fatal: not a git repository (or any of the parent directories): .git
/tmp/pip-install-dt3l45za/flash-attn_76bb505f607b4d9783ee43defc787cf6/setup.py:95: UserWarning: flash_attn was requested, but nvcc was not found. Are you sure your environment has nvcc available? If you're installing within a container from https://hub.docker.com/r/pytorch/pytorch, only images whose names contain 'devel' will provide nvcc.
warnings.warn(
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "/tmp/pip-install-dt3l45za/flash-attn_76bb505f607b4d9783ee43defc787cf6/setup.py", line 179, in
CUDAExtension(
File "/home/mohi/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1077, in CUDAExtension
library_dirs += library_paths(cuda=True)
File "/home/mohi/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1204, in library_paths
if (not os.path.exists(_join_cuda_home(lib_dir)) and
File "/home/mohi/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2419, in _join_cuda_home
raise OSError('CUDA_HOME environment variable is not set. '
OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

  torch.__version__  = 2.3.0+cu121
  
  
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

mohi0170

Aug 16, 2024

How we can fix that error

bys0318

Z.ai org Aug 16, 2024

•

edited Aug 16, 2024

Hi, sorry for bringing this inconvenience. Currently our code relies on FlashAttention2 while FlashAttention2 only supports GPU environment. We will soon (hopefully by the end of this week) update the code to remove the reliance on FlashAttention2. Please stay tuned.

bys0318

Z.ai org Aug 17, 2024

•

edited Aug 17, 2024

Good news! We've updated the modeling_chatglm.py to get rid of the dependency on FlashAttention2. Now I believe you can successfully run it on CPU.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment