Video-Text-to-Text
Transformers
Safetensors
English
qwen3_5
text-generation
video
multimodal
video-captioning
temporal-grounding
qwen
VLM
custom_code
Instructions to use NemoStation/Marlin-2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NemoStation/Marlin-2B with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForCausalLM processor = AutoProcessor.from_pretrained("NemoStation/Marlin-2B", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("NemoStation/Marlin-2B", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
Inference speed
#8
by tintwotin - opened
Using Find on a 2 min. 1920x832 video takes: 459.15s on RTX 4090 - can anything be done to speed it up? Like downscaling the video beforehand? Or is a turbo version planned?
459s for a 2-min 1920Γ832 clip is on the slow end but expected at that resolution. Two things you can try:
- Pre-downscale the video. 1920Γ832 is roughly 8Γ over the model's per-frame pixel budget (we cap at ~200K pixels via smart_resize internally). The internal resize handles it, but at decode cost. Downscaling to ~640Γ270 before sending to the model cuts the visual-encoder time substantially without hurting accuracy for grounding-style queries.
- Quantise the weights. On a 4090, AWQ-quantised weights + bf16 KV-cache typically give 3-4Γ throughput vs vanilla bf16. We haven't shipped a quantized checkpoint ourselves yet, but you can do this in a half-hour with llm-compressor or AutoAWQ. If you do, we'd be curious what mIoU you get on TimeLens-Bench to compare against our bf16 numbers.
No "turbo" variant planned β the model is already 2B params, so the realistic speedup path is inference-side, not architectural.
I tried downscaling and it didn't help. I would like to add it to my Pallaidium AI add-on for Blender, but currently it is simply too slow for me. Will check in later to see if something has improved. Thank you.