nvidia
/

Cosmos3-Super

@@ -10,54 +10,11 @@ tags:
   - cosmos3
   - vllm
   - vllm-omni
   - diffusers
   - text, image, video, audio, and action generation
   - omnimodel
-countDownloads:
-  - checkpoint.json
-  - config.json
-  - generation_config.json
-  - model.safetensors.index.json
-  - model_index.json
-  - tokenizer.json
-  - tokenizer_config.json
-  - sound_tokenizer/config.json
-  - sound_tokenizer/diffusion_pytorch_model.safetensors
-  - text_tokenizer/tokenizer.json
-  - text_tokenizer/tokenizer_config.json
-  - transformer/config.json
-  - transformer/diffusion_pytorch_model-00001-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00002-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00003-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00004-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00005-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00006-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00007-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00008-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00009-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00010-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00011-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00012-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00013-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00014-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00015-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00016-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00017-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00018-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00019-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00020-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00021-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00022-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00023-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00024-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00025-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00026-of-00027.safetensors
-  - transformer/diffusion_pytorch_model-00027-of-00027.safetensors
-  - transformer/diffusion_pytorch_model.safetensors.index.json
-  - vae/config.json
-  - vae/diffusion_pytorch_model.safetensors
-  - vision_encoder/config.json
-  - vision_encoder/model.safetensors
 ---
 # **Cosmos 3: Omnimodal World Models for Physical AI**
@@ -972,6 +929,69 @@ Cosmos3 may produce imperfect outputs in challenging scenarios. Generation artif
 Cosmos3 outputs should not be treated as physically accurate simulation, reliable ground-truth reasoning, or safety-certified decision making. Applications involving robotics control, autonomous systems, scientific simulation, or safety-critical planning require additional validation, external constraints, system-level safety analysis, and domain-specific guardrails before deployment.
 ## Inference
 **Acceleration Engine:** [PyTorch](https://pytorch.org/), [vLLM](https://github.com/vllm-project/vllm), [vLLM-Omni](https://github.com/vllm-project/vllm-omni), [Hugging Face Diffusers](https://github.com/huggingface/diffusers)

   - cosmos3
   - vllm
   - vllm-omni
+  - sglang
+  - sglang-diffusion
   - diffusers
   - text, image, video, audio, and action generation
   - omnimodel
 ---
 # **Cosmos 3: Omnimodal World Models for Physical AI**
 Cosmos3 outputs should not be treated as physically accurate simulation, reliable ground-truth reasoning, or safety-certified decision making. Applications involving robotics control, autonomous systems, scientific simulation, or safety-critical planning require additional validation, external constraints, system-level safety analysis, and domain-specific guardrails before deployment.
+### SGLang
+SGLang-Diffusion can serve `nvidia/Cosmos3-Super` through OpenAI-compatible image and video endpoints. Install SGLang from the main branch with diffusion dependencies, then start the server:
+```bash
+git clone --branch main https://github.com/sgl-project/sglang.git
+cd sglang
+pip install -e "python[diffusion]"
+pip install "cosmos-guardrail==0.3.1"
+sglang serve \
+  --model-path nvidia/Cosmos3-Super \
+  --num-gpus 4
+```
+Cosmos 3 support in SGLang Diffusion currently requires the SGLang main branch. Switch to a stable SGLang release once Cosmos 3 support is included there.
+For the video-specialized checkpoint:
+```bash
+sglang serve \
+  --model-path nvidia/Cosmos3-Super-Image2Video \
+  --num-gpus 4
+```
+Supported SGLang endpoints:
+| Mode | Endpoint | Notes |
+| --- | --- | --- |
+| Text to image | `POST /v1/images/generations` | Returns base64 image data by default |
+| Text to video | `POST /v1/videos` | Creates an async job; poll `GET /v1/videos/{id}` and download `/content` |
+| Image to video | `POST /v1/videos` | Upload the conditioning image with `input_reference` |
+Example text-to-video request:
+```bash
+job_id=$(curl -sS -X POST http://localhost:30000/v1/videos \
+  --form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \
+  --form-string "negative_prompt=blurry, distorted, low quality" \
+  --form-string "size=1280x720" \
+  --form-string "num_frames=81" \
+  --form-string "fps=24" \
+  --form-string "num_inference_steps=35" \
+  --form-string "guidance_scale=4.0" \
+  --form-string "flow_shift=10.0" \
+  --form-string "seed=42" \
+  --form-string 'extra_params={"guardrails":true,"use_resolution_template":false,"use_duration_template":false}' \
+  | python -c 'import json, sys; print(json.load(sys.stdin)["id"])')
+while true; do
+  status=$(curl -sS "http://localhost:30000/v1/videos/${job_id}" \
+    | python -c 'import json, sys; print(json.load(sys.stdin)["status"])')
+  [ "$status" = "completed" ] && break
+  [ "$status" = "failed" ] && exit 1
+  sleep 1
+done
+curl -sS -L "http://localhost:30000/v1/videos/${job_id}/content" \
+  -o cosmos3_super_t2v_output.mp4
+```
+Video-to-video, video-with-sound, and action generation are not supported by SGLang yet.
 ## Inference
 **Acceleration Engine:** [PyTorch](https://pytorch.org/), [vLLM](https://github.com/vllm-project/vllm), [vLLM-Omni](https://github.com/vllm-project/vllm-omni), [Hugging Face Diffusers](https://github.com/huggingface/diffusers)

sound_tokenizer.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6daeb68a219f3e86c0918f616d78b9ebf073f3d700df63ff1c02d214c081d72d
+size 1985246007

sound_tokenizer.json ADDED Viewed

	@@ -0,0 +1,42 @@

+{
+    "model_type": "autoencoder_v2",
+    "sampling_rate": 48000,
+    "stereo": true,
+    "use_wav_as_input": true,
+    "normalize_volume": true,
+    "hop_size": 1920,
+    "input_channels": 1,
+    "enc_type": "spec_convnext",
+    "enc_dim": 192,
+    "enc_intermediate_dim": 768,
+    "enc_num_layers": 12,
+    "enc_num_blocks": 2,
+    "enc_n_fft": 64,
+    "enc_hop_length": 16,
+    "enc_latent_dim": 128,
+    "enc_c_mults": [1, 2, 4],
+    "enc_strides": [4, 5, 6],
+    "enc_identity_init": false,
+    "enc_use_snake": true,
+    "dec_type": "oobleck",
+    "dec_dim": 320,
+    "dec_c_mults": [1, 2, 4, 8, 16],
+    "dec_strides": [2, 4, 5, 6, 8],
+    "dec_use_snake": true,
+    "dec_final_tanh": false,
+    "dec_out_channels": 2,
+    "dec_anti_aliasing": false,
+    "dec_use_nearest_upsample": false,
+    "dec_use_tanh_at_final": false,
+    "bottleneck_type": "vae",
+    "bottleneck": {"type": "vae"},
+    "activation": "snakebeta",
+    "snake_logscale": true,
+    "anti_aliasing": false,
+    "use_cuda_kernel": false,
+    "causal": false,
+    "padding_mode": "zeros",
+    "vocoder_input_dim": 64,
+    "latent_mean": null,
+    "latent_std": null
+}