Can you use this model with image and text-only inputs apart from video?

by lunahr - opened 13 days ago

The video capability is cool, but can you also perform inference with images or just text?

Also would freezing the vision encoder work to train the LLM part of the model but keep its capabilities to use the vision encoder?

rethinkNow

Nemo Station org 8 days ago

No, image inference doesn't work — we trained Marlin-2B for video specifically.
Freezing the ViT alone wouldn't preserve image capability anyway. The dominant drift isn't in the encoder itself, it's in the merger between encoder and LLM — even with a frozen ViT, the merger re-aligns to the changing LLM during fine-tuning, which is where most of the capability erosion comes from. Our own v0 SFT went the other way: a frozen ViT was *under-*trained, and video quality improved once we unfroze it with vit_lr=1e-4. If you want to preserve an upstream capability, you mix image data into the SFT — freezing alone isn't enough.

FlameF0X

8 days ago

The video capability is cool, but can you also perform inference with images or just text?

Also would freezing the vision encoder work to train the LLM part of the model but keep its capabilities to use the vision encoder?

You could try to convert a image into a video and give it that. I dont know how it would behave tho.

rethinkNow

Nemo Station org 8 days ago

Surely will try that and share the results

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment