Cosmos 3 version?

#185

by nonetrix - opened 3 days ago

•

Just dropping this announcement here. If we're lucky this could be decent base for Anima 2 or some other model, not a expert so not sure

https://x.com/NVIDIAAI/status/2061308434629132553?s=20

https://huggingface.co/nvidia/Cosmos3-Super-Text2Image

https://huggingface.co/nvidia/Cosmos3-Nano

nonetrix changed discussion title from Cosmos 3 might be coming soon to Cosmos 3 version? 3 days ago

equal-l2

2 days ago

Cosmos3-Nano is 16B while Cosmos 2.5 is 2B, so it will much more resource intensive than current anima.

ItzPingCat

2 days ago

If you want something that big you might as well train off qwen or z image

Kimmypox

1 day ago

Bro, training a 16B model will straight up bankrupt.

zak01010101

1 day ago

If they made anima 2.0 with this they won't make it open source 🙏🥀

Iwaku-Real

about 22 hours ago

Cosmos3-Nano is 16B while Cosmos 2.5 is 2B, so it will much more resource intensive than current anima.

But also insanely better in everything else, because it has 8x the parameters. It could also run on RTX 40+ GPUs in MXFP4 or on 50-series GPUs, NVFP4 format, providing huge speed gains with negligible quality loss.

Iwaku-Real

about 22 hours ago

If you want something that big you might as well train off qwen or z image

The point of using Cosmos is that it's an excellent blank slate for spatial understanding. It was trained for use in physical AI like robotics, etc., so it knows a lot about where things should be in the output. That really becomes important when you're trying to generate complex scenes where a character may be in a very dynamic pose, but it also applies to 'prompt bleed' where a character you prompt red eyes for could unintentionally have red hair. CircleStone doesn't need to focus on Anima getting those things right as much because Cosmos already knew it in the first place. Z-Image, Qwen, and even Flux.2 are poorly received even among those who can run it because of issues like prompt comprehension, plastic skin, ridiculous censorship, broken limbs, and what almost appeared to be latent collapse (outputs were TOO consistent). Those bases will always be biased to doing that, and it would be a lot more work to try to override that than to continue on Cosmos.

Iwaku-Real

about 22 hours ago

Bro, training a 16B model will straight up bankrupt.

If they made anima 2.0 with this they won't make it open source 🙏🥀

Don't make assumptions about CircleStone. They started small with 2B for a reason. And it's a collab with Comfy Org too, which is the reason why it's had perfectly stable ComfyUI support since day zero. Comfy Org would absolutely not be working with CircleStone if they forced everyone to buy cloud credits just to run Anima.

Iwaku-Real

about 20 hours ago

I almost forgot – there is also a Cosmos 3 Edge 4B coming. That is something to look forward to...

equal-l2

about 18 hours ago

•

edited about 18 hours ago

'prompt bleed' where a character you prompt red eyes for could unintentionally have red hair

I'm not sure if simply scaling up or swap the diffusion model alone would fix it. We might also need to deal with the text encoder, or perhaps even improve the dataset itself.

Kimmypox

about 16 hours ago

But also insanely better in everything else, because it has 8x the parameters. It could also run on RTX 40+ GPUs in MXFP4 or on 50-series GPUs, NVFP4 format, providing huge speed gains with negligible quality loss.

Bro, keep the 16B model for yourself. Not everyone here has a high-end RTX 40 or 50 series GPU. And don't expect the community to help train it either. Not everyone has the money for cloud computing, and it's not like all have top-tier high-end server GPUs sitting next door, bro.

Iwaku-Real

about 16 hours ago

•

edited about 5 hours ago

Of course it's not as easy as scaling up. They will have to learn how to work with the architecture.
There are other huge things about it that I almost forgot about:

The input encoder is self-contained inside Cosmos 3. It uses Qwen3-VL and is either 2B, 8B, or 32B depending on the Cosmos 3 base's size. (Yeah, over HALF the model is dedicated to processing your inputs.) That means they no longer need an llm_adapter network, which they have been doing for Anima 1.0 because Cosmos 2 is based on T5 instead of Qwen3. However, they would still need to abliterate the encoder first for best results, since it needs to do NSFW too.
Cosmos 3 natively supports ALL of these as inputs and outputs:

Text
Image
Video (720p 24 FPS for up to 400 frames or 16.7 seconds)
Audio
Action (coordinate data representing movement in 3D)
Just imagine how much is possible with all of these!!!

Cosmos 3 Super 64B Text2Image has already topped the Artificial Analysis Text to Image Leaderboard for all open weight models and is barely trailing behind Nano Banana 2. I'd imagine Nano 16B is pretty high as well. That just goes to show how insane these models already are before further training.
Most importantly, there are actually THREE Cosmos 3 bases:

Super 64B (~133GB BF16, ~65GB FP8, ~37GB NVFP4)
Nano 16B (~35GB BF16, ~20.4GB FP8, ~14GB NVFP4)
🔜 Edge 4B (~9-11GB BF16)
This one in particular will be a HUGE starting point for CircleStone. While Nvidia seems to be delaying it (and it better be so they can bake it more 😅), it will have an output part the same size as Anima, but its encoder will be 2.5x bigger, and still capable of the same input/output formats above. Sure, Wan 2.1 1.3B sucked, but it's been a year. It'll probably perform better than both that and Anima combined.

Thanks to its much smaller size, Edge 4B will be a lot easier to train and to run inference with. And because Nvidia made these models on powers of base 4, it means anyone could run Cosmos 3-based models at any quantization that best fits their GPU(s). From a 4060 laptop sleeper to a B200 hyperscaler, there's something for everyone. And you won't have to be locked to cloud models. That is why this is certainly the future of image and video generation.

nonetrix

about 8 hours ago

Cosmos3-Nano is 16B while Cosmos 2.5 is 2B, so it will much more resource intensive than current anima.

But also insanely better in everything else, because it has 8x the parameters. It could also run on RTX 40+ GPUs in MXFP4 or on 50-series GPUs, NVFP4 format, providing huge speed gains with negligible quality loss.

I agree, I don't think it would be terrible to have a model on the bigger size for those who can afford it and still have smaller models also I'd be fine if they hosted inference of it as long as it's all completely open source still

Iwaku-Real

about 3 hours ago

•

edited about 3 hours ago

For those of you who think Comfy Org is broke: they are NOWHERE near being broke. They raised over $48 million in venture capital, and they're granting $1 million to support open-source AI development, including that of Anima. That grant alone is equivalent to renting 8x B300 288GB GPUs for a year straight. So CircleStone is getting plenty of support. They have more than enough compute potential to get started on training from something like Cosmos 3 Edge 4B (when it's out), and they have massive incentive to do so for the reasons I stated previously.

kdutt2000

about 3 hours ago

Cosmos3-Nano is 16B while Cosmos 2.5 is 2B, so it will much more resource intensive than current anima.

They have a Cosmos-Predict2.5-2B base distilled extracted DMD2 LoRA:

https://civitai.com/models/2466415/cosmos-predict25-2b-base-distilled-extracted-dmd2-lora

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment