Video-Text-to-Text
Transformers
Safetensors
English
qwen3_5
text-generation
video
multimodal
video-captioning
temporal-grounding
qwen
VLM
custom_code
Instructions to use NemoStation/Marlin-2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NemoStation/Marlin-2B with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForCausalLM processor = AutoProcessor.from_pretrained("NemoStation/Marlin-2B", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("NemoStation/Marlin-2B", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
This model work by feeding multi sampling frame from video or raw video file?
#7
by CT-Ati - opened
If It is the multi video base
Can we have preprocessing script or optimum sampling rate If we want to apply this into streamming input src?
BTW I just assume base on this info in readme
Video preprocessing
The custom modeling code sets these env vars internally (matches the training-time setup). If you want to override them, set them in your shell before importing transformers:
Env var Default What it does
FORCE_QWENVL_VIDEO_READER torchcodec Video decoder backend
VIDEO_MAX_PIXELS 200704 Max pixels per frame (~448×448)
FPS 2.0 Frame sampling rate
FPS_MAX_FRAMES 240 Cap on total frames (covers ~2 min videos)
FPS_MIN_FRAMES 4 Floor for very short videos
env var that are specified by default are the optimum value, the model was predominately trained at 2 fps.