Add SGLang serving instructions

#11
by MickJ - opened
Files changed (3) hide show
  1. README.md +65 -45
  2. sound_tokenizer.ckpt +3 -0
  3. sound_tokenizer.json +42 -0
README.md CHANGED
@@ -10,54 +10,11 @@ tags:
10
  - cosmos3
11
  - vllm
12
  - vllm-omni
 
 
13
  - diffusers
14
  - text, image, video, audio, and action generation
15
  - omnimodel
16
- countDownloads:
17
- - checkpoint.json
18
- - config.json
19
- - generation_config.json
20
- - model.safetensors.index.json
21
- - model_index.json
22
- - tokenizer.json
23
- - tokenizer_config.json
24
- - sound_tokenizer/config.json
25
- - sound_tokenizer/diffusion_pytorch_model.safetensors
26
- - text_tokenizer/tokenizer.json
27
- - text_tokenizer/tokenizer_config.json
28
- - transformer/config.json
29
- - transformer/diffusion_pytorch_model-00001-of-00027.safetensors
30
- - transformer/diffusion_pytorch_model-00002-of-00027.safetensors
31
- - transformer/diffusion_pytorch_model-00003-of-00027.safetensors
32
- - transformer/diffusion_pytorch_model-00004-of-00027.safetensors
33
- - transformer/diffusion_pytorch_model-00005-of-00027.safetensors
34
- - transformer/diffusion_pytorch_model-00006-of-00027.safetensors
35
- - transformer/diffusion_pytorch_model-00007-of-00027.safetensors
36
- - transformer/diffusion_pytorch_model-00008-of-00027.safetensors
37
- - transformer/diffusion_pytorch_model-00009-of-00027.safetensors
38
- - transformer/diffusion_pytorch_model-00010-of-00027.safetensors
39
- - transformer/diffusion_pytorch_model-00011-of-00027.safetensors
40
- - transformer/diffusion_pytorch_model-00012-of-00027.safetensors
41
- - transformer/diffusion_pytorch_model-00013-of-00027.safetensors
42
- - transformer/diffusion_pytorch_model-00014-of-00027.safetensors
43
- - transformer/diffusion_pytorch_model-00015-of-00027.safetensors
44
- - transformer/diffusion_pytorch_model-00016-of-00027.safetensors
45
- - transformer/diffusion_pytorch_model-00017-of-00027.safetensors
46
- - transformer/diffusion_pytorch_model-00018-of-00027.safetensors
47
- - transformer/diffusion_pytorch_model-00019-of-00027.safetensors
48
- - transformer/diffusion_pytorch_model-00020-of-00027.safetensors
49
- - transformer/diffusion_pytorch_model-00021-of-00027.safetensors
50
- - transformer/diffusion_pytorch_model-00022-of-00027.safetensors
51
- - transformer/diffusion_pytorch_model-00023-of-00027.safetensors
52
- - transformer/diffusion_pytorch_model-00024-of-00027.safetensors
53
- - transformer/diffusion_pytorch_model-00025-of-00027.safetensors
54
- - transformer/diffusion_pytorch_model-00026-of-00027.safetensors
55
- - transformer/diffusion_pytorch_model-00027-of-00027.safetensors
56
- - transformer/diffusion_pytorch_model.safetensors.index.json
57
- - vae/config.json
58
- - vae/diffusion_pytorch_model.safetensors
59
- - vision_encoder/config.json
60
- - vision_encoder/model.safetensors
61
  ---
62
 
63
  # **Cosmos 3: Omnimodal World Models for Physical AI**
@@ -972,6 +929,69 @@ Cosmos3 may produce imperfect outputs in challenging scenarios. Generation artif
972
 
973
  Cosmos3 outputs should not be treated as physically accurate simulation, reliable ground-truth reasoning, or safety-certified decision making. Applications involving robotics control, autonomous systems, scientific simulation, or safety-critical planning require additional validation, external constraints, system-level safety analysis, and domain-specific guardrails before deployment.
974
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
975
  ## Inference
976
 
977
  **Acceleration Engine:** [PyTorch](https://pytorch.org/), [vLLM](https://github.com/vllm-project/vllm), [vLLM-Omni](https://github.com/vllm-project/vllm-omni), [Hugging Face Diffusers](https://github.com/huggingface/diffusers)
 
10
  - cosmos3
11
  - vllm
12
  - vllm-omni
13
+ - sglang
14
+ - sglang-diffusion
15
  - diffusers
16
  - text, image, video, audio, and action generation
17
  - omnimodel
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  ---
19
 
20
  # **Cosmos 3: Omnimodal World Models for Physical AI**
 
929
 
930
  Cosmos3 outputs should not be treated as physically accurate simulation, reliable ground-truth reasoning, or safety-certified decision making. Applications involving robotics control, autonomous systems, scientific simulation, or safety-critical planning require additional validation, external constraints, system-level safety analysis, and domain-specific guardrails before deployment.
931
 
932
+ ### SGLang
933
+
934
+ SGLang-Diffusion can serve `nvidia/Cosmos3-Super` through OpenAI-compatible image and video endpoints. Install SGLang from the main branch with diffusion dependencies, then start the server:
935
+
936
+ ```bash
937
+ git clone --branch main https://github.com/sgl-project/sglang.git
938
+ cd sglang
939
+ pip install -e "python[diffusion]"
940
+ pip install "cosmos-guardrail==0.3.1"
941
+
942
+ sglang serve \
943
+ --model-path nvidia/Cosmos3-Super \
944
+ --num-gpus 4
945
+ ```
946
+
947
+ Cosmos 3 support in SGLang Diffusion currently requires the SGLang main branch. Switch to a stable SGLang release once Cosmos 3 support is included there.
948
+
949
+ For the video-specialized checkpoint:
950
+
951
+ ```bash
952
+ sglang serve \
953
+ --model-path nvidia/Cosmos3-Super-Image2Video \
954
+ --num-gpus 4
955
+ ```
956
+
957
+ Supported SGLang endpoints:
958
+
959
+ | Mode | Endpoint | Notes |
960
+ | --- | --- | --- |
961
+ | Text to image | `POST /v1/images/generations` | Returns base64 image data by default |
962
+ | Text to video | `POST /v1/videos` | Creates an async job; poll `GET /v1/videos/{id}` and download `/content` |
963
+ | Image to video | `POST /v1/videos` | Upload the conditioning image with `input_reference` |
964
+
965
+ Example text-to-video request:
966
+
967
+ ```bash
968
+ job_id=$(curl -sS -X POST http://localhost:30000/v1/videos \
969
+ --form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \
970
+ --form-string "negative_prompt=blurry, distorted, low quality" \
971
+ --form-string "size=1280x720" \
972
+ --form-string "num_frames=81" \
973
+ --form-string "fps=24" \
974
+ --form-string "num_inference_steps=35" \
975
+ --form-string "guidance_scale=4.0" \
976
+ --form-string "flow_shift=10.0" \
977
+ --form-string "seed=42" \
978
+ --form-string 'extra_params={"guardrails":true,"use_resolution_template":false,"use_duration_template":false}' \
979
+ | python -c 'import json, sys; print(json.load(sys.stdin)["id"])')
980
+
981
+ while true; do
982
+ status=$(curl -sS "http://localhost:30000/v1/videos/${job_id}" \
983
+ | python -c 'import json, sys; print(json.load(sys.stdin)["status"])')
984
+ [ "$status" = "completed" ] && break
985
+ [ "$status" = "failed" ] && exit 1
986
+ sleep 1
987
+ done
988
+
989
+ curl -sS -L "http://localhost:30000/v1/videos/${job_id}/content" \
990
+ -o cosmos3_super_t2v_output.mp4
991
+ ```
992
+
993
+ Video-to-video, video-with-sound, and action generation are not supported by SGLang yet.
994
+
995
  ## Inference
996
 
997
  **Acceleration Engine:** [PyTorch](https://pytorch.org/), [vLLM](https://github.com/vllm-project/vllm), [vLLM-Omni](https://github.com/vllm-project/vllm-omni), [Hugging Face Diffusers](https://github.com/huggingface/diffusers)
sound_tokenizer.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6daeb68a219f3e86c0918f616d78b9ebf073f3d700df63ff1c02d214c081d72d
3
+ size 1985246007
sound_tokenizer.json ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "autoencoder_v2",
3
+ "sampling_rate": 48000,
4
+ "stereo": true,
5
+ "use_wav_as_input": true,
6
+ "normalize_volume": true,
7
+ "hop_size": 1920,
8
+ "input_channels": 1,
9
+ "enc_type": "spec_convnext",
10
+ "enc_dim": 192,
11
+ "enc_intermediate_dim": 768,
12
+ "enc_num_layers": 12,
13
+ "enc_num_blocks": 2,
14
+ "enc_n_fft": 64,
15
+ "enc_hop_length": 16,
16
+ "enc_latent_dim": 128,
17
+ "enc_c_mults": [1, 2, 4],
18
+ "enc_strides": [4, 5, 6],
19
+ "enc_identity_init": false,
20
+ "enc_use_snake": true,
21
+ "dec_type": "oobleck",
22
+ "dec_dim": 320,
23
+ "dec_c_mults": [1, 2, 4, 8, 16],
24
+ "dec_strides": [2, 4, 5, 6, 8],
25
+ "dec_use_snake": true,
26
+ "dec_final_tanh": false,
27
+ "dec_out_channels": 2,
28
+ "dec_anti_aliasing": false,
29
+ "dec_use_nearest_upsample": false,
30
+ "dec_use_tanh_at_final": false,
31
+ "bottleneck_type": "vae",
32
+ "bottleneck": {"type": "vae"},
33
+ "activation": "snakebeta",
34
+ "snake_logscale": true,
35
+ "anti_aliasing": false,
36
+ "use_cuda_kernel": false,
37
+ "causal": false,
38
+ "padding_mode": "zeros",
39
+ "vocoder_input_dim": 64,
40
+ "latent_mean": null,
41
+ "latent_std": null
42
+ }