Instructions to use NemoStation/Marlin-2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NemoStation/Marlin-2B with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForCausalLM processor = AutoProcessor.from_pretrained("NemoStation/Marlin-2B", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("NemoStation/Marlin-2B", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
Can you use this model with image and text-only inputs apart from video?
The video capability is cool, but can you also perform inference with images or just text?
Also would freezing the vision encoder work to train the LLM part of the model but keep its capabilities to use the vision encoder?
No, image inference doesn't work β we trained Marlin-2B for video specifically.
Freezing the ViT alone wouldn't preserve image capability anyway. The dominant drift isn't in the encoder itself, it's in the merger between encoder and LLM β even with a frozen ViT, the merger re-aligns to the changing LLM during fine-tuning, which is where most of the capability erosion comes from. Our own v0 SFT went the other way: a frozen ViT was *under-*trained, and video quality improved once we unfroze it with vit_lr=1e-4. If you want to preserve an upstream capability, you mix image data into the SFT β freezing alone isn't enough.
The video capability is cool, but can you also perform inference with images or just text?
Also would freezing the vision encoder work to train the LLM part of the model but keep its capabilities to use the vision encoder?
You could try to convert a image into a video and give it that. I dont know how it would behave tho.
Surely will try that and share the results