🎬 Tempo-6B: Efficient Query-Aware Long Video Understanding

Project Page Paper Hugging Face GitHub License

Tempo-6B is an efficient, query-aware Multimodal Large Language Model (MLLM) designed explicitly for extreme-long video understanding. It was presented in the paper Small Vision-Language Models are Smart Compressors for Long Video Understanding.

Tempo effectively resolves the structural mismatch between massive video streams and bounded LLM context windows by acting as a smart temporal compressor. It performs early cross-modal distillation, generating highly compact, intent-aligned video representations in a single forward pass.

πŸ—οΈ Architecture

Tempo natively unifies a local Small Vision-Language Model (SVLM) and a global Large Language Model (LLM).

  • Local Compressor: Qwen3-VL-2B-Instruct
  • Global LLM: Qwen/Qwen3-4B
  • Total Parameters: ~6B

✨ Key Features

  • Adaptive Token Allocation (ATA): Acts as a training-free, O(1) dynamic router. It allocates dense representational bandwidth only to query-critical segments.
  • Token Efficiency: Achieves aggressive dynamic compression (0.5–16 tokens/frame), maintaining global causality while discarding redundancies.
  • Hour-Long Video Capability: Effectively processes and answers complex queries for videos over an hour long without hitting context limits.

πŸš€ Quick Start

1. Installation

# Clone the repository
git clone https://github.com/FeiElysia/Tempo.git
cd Tempo

# Create environment
conda create -n tempo python=3.12 -y
conda activate tempo

# Install dependencies
pip install -r requirements.txt

2. Prepare Checkpoints

To run the inference script successfully, you need to download both the Tempo-6B weights and the base Qwen3-VL model for architecture initialization.

mkdir -p checkpoints

# 1. Download the final Tempo-6B model
huggingface-cli download --resume-download Vision-CAIR/Tempo-6B --local-dir ./checkpoints/Tempo-6B

# 2. Download the base Qwen3-VL model
huggingface-cli download --resume-download Qwen/Qwen3-VL-2B-Instruct --local-dir ./checkpoints/Qwen3-VL-2B-Instruct

3. Inference

Launch Gradio Web UI:

python app.py

CLI Inference:

python infer.py \
    --model_path "./checkpoints/Tempo-6B" \
    --video_path "/path/to/your/video.mp4" \
    --query "Describe the video in detail."

(Note: Since Tempo relies on custom routing mechanisms, downloading the weights directly via transformers without the official codebase will not work out-of-the-box.)

πŸ† Performance

Tempo-6B achieves state-of-the-art performance on extreme-long video tasks. On LVBench (average video length 4101s), Tempo-6B scores 52.3, outperforming proprietary baselines like GPT-4o and Gemini 1.5 Pro.

πŸ“‘ Citation

@article{fei2026small,
  title={Small Vision-Language Models are Smart Compressors for Long Video Understanding},
  author={Fei, Junjie and Chen, Jun and Liu, Zechun and Xiong, Yunyang and Zhou, Chong and Wen, Wei and Han, Junlin and Zhuge, Mingchen and Suri, Saksham and Qian, Qi and others},
  journal={arXiv preprint arXiv:2604.08120},
  year={2026}
}
Downloads last month
100
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Vision-CAIR/Tempo-6B

Finetuned
Qwen/Qwen3-4B
Finetuned
(561)
this model

Space using Vision-CAIR/Tempo-6B 1

Collection including Vision-CAIR/Tempo-6B

Paper for Vision-CAIR/Tempo-6B