🎬 Tempo-6B: Efficient Query-Aware Long Video Understanding

Tempo-6B is an efficient, query-aware Multimodal Large Language Model (MLLM) designed explicitly for extreme-long video understanding. It was presented in the paper Small Vision-Language Models are Smart Compressors for Long Video Understanding.

Tempo effectively resolves the structural mismatch between massive video streams and bounded LLM context windows by acting as a smart temporal compressor. It performs early cross-modal distillation, generating highly compact, intent-aligned video representations in a single forward pass.

🏗️ Architecture

Tempo natively unifies a local Small Vision-Language Model (SVLM) and a global Large Language Model (LLM).

Local Compressor: Qwen3-VL-2B-Instruct
Global LLM: Qwen/Qwen3-4B
Total Parameters: ~6B

✨ Key Features

Adaptive Token Allocation (ATA): Acts as a training-free, O(1) dynamic router. It allocates dense representational bandwidth only to query-critical segments.
Token Efficiency: Achieves aggressive dynamic compression (0.5–16 tokens/frame), maintaining global causality while discarding redundancies.
Hour-Long Video Capability: Effectively processes and answers complex queries for videos over an hour long without hitting context limits.

🚀 Quick Start

1. Installation

# Clone the repository
git clone https://github.com/FeiElysia/Tempo.git
cd Tempo

# Create environment
conda create -n tempo python=3.12 -y
conda activate tempo

# Install dependencies
pip install -r requirements.txt

2. Prepare Checkpoints

To run the inference script successfully, you need to download both the Tempo-6B weights and the base Qwen3-VL model for architecture initialization.

mkdir -p checkpoints

# 1. Download the final Tempo-6B model
huggingface-cli download --resume-download Vision-CAIR/Tempo-6B --local-dir ./checkpoints/Tempo-6B

# 2. Download the base Qwen3-VL model
huggingface-cli download --resume-download Qwen/Qwen3-VL-2B-Instruct --local-dir ./checkpoints/Qwen3-VL-2B-Instruct

3. Inference

Launch Gradio Web UI:

python app.py

CLI Inference:

python infer.py \
    --model_path "./checkpoints/Tempo-6B" \
    --video_path "/path/to/your/video.mp4" \
    --query "Describe the video in detail."

(Note: Since Tempo relies on custom routing mechanisms, downloading the weights directly via transformers without the official codebase will not work out-of-the-box.)

🏆 Performance

Tempo-6B achieves state-of-the-art performance on extreme-long video tasks. On LVBench (average video length 4101s), Tempo-6B scores 52.3, outperforming proprietary baselines like GPT-4o and Gemini 1.5 Pro.

📑 Citation

@article{fei2026small,
  title={Small Vision-Language Models are Smart Compressors for Long Video Understanding},
  author={Fei, Junjie and Chen, Jun and Liu, Zechun and Xiong, Yunyang and Zhou, Chong and Wen, Wei and Han, Junlin and Zhuge, Mingchen and Suri, Saksham and Qian, Qi and others},
  journal={arXiv preprint arXiv:2604.08120},
  year={2026}
}