π¬ Tempo-6B: Efficient Query-Aware Long Video Understanding
Tempo-6B is an efficient, query-aware Multimodal Large Language Model (MLLM) designed explicitly for extreme-long video understanding. It was presented in the paper Small Vision-Language Models are Smart Compressors for Long Video Understanding.
Tempo effectively resolves the structural mismatch between massive video streams and bounded LLM context windows by acting as a smart temporal compressor. It performs early cross-modal distillation, generating highly compact, intent-aligned video representations in a single forward pass.
ποΈ Architecture
Tempo natively unifies a local Small Vision-Language Model (SVLM) and a global Large Language Model (LLM).
- Local Compressor: Qwen3-VL-2B-Instruct
- Global LLM: Qwen/Qwen3-4B
- Total Parameters: ~6B
β¨ Key Features
- Adaptive Token Allocation (ATA): Acts as a training-free, O(1) dynamic router. It allocates dense representational bandwidth only to query-critical segments.
- Token Efficiency: Achieves aggressive dynamic compression (0.5β16 tokens/frame), maintaining global causality while discarding redundancies.
- Hour-Long Video Capability: Effectively processes and answers complex queries for videos over an hour long without hitting context limits.
π Quick Start
1. Installation
# Clone the repository
git clone https://github.com/FeiElysia/Tempo.git
cd Tempo
# Create environment
conda create -n tempo python=3.12 -y
conda activate tempo
# Install dependencies
pip install -r requirements.txt
2. Prepare Checkpoints
To run the inference script successfully, you need to download both the Tempo-6B weights and the base Qwen3-VL model for architecture initialization.
mkdir -p checkpoints
# 1. Download the final Tempo-6B model
huggingface-cli download --resume-download Vision-CAIR/Tempo-6B --local-dir ./checkpoints/Tempo-6B
# 2. Download the base Qwen3-VL model
huggingface-cli download --resume-download Qwen/Qwen3-VL-2B-Instruct --local-dir ./checkpoints/Qwen3-VL-2B-Instruct
3. Inference
Launch Gradio Web UI:
python app.py
CLI Inference:
python infer.py \
--model_path "./checkpoints/Tempo-6B" \
--video_path "/path/to/your/video.mp4" \
--query "Describe the video in detail."
(Note: Since Tempo relies on custom routing mechanisms, downloading the weights directly via transformers without the official codebase will not work out-of-the-box.)
π Performance
Tempo-6B achieves state-of-the-art performance on extreme-long video tasks. On LVBench (average video length 4101s), Tempo-6B scores 52.3, outperforming proprietary baselines like GPT-4o and Gemini 1.5 Pro.
π Citation
@article{fei2026small,
title={Small Vision-Language Models are Smart Compressors for Long Video Understanding},
author={Fei, Junjie and Chen, Jun and Liu, Zechun and Xiong, Yunyang and Zhou, Chong and Wen, Wei and Han, Junlin and Zhuge, Mingchen and Suri, Saksham and Qian, Qi and others},
journal={arXiv preprint arXiv:2604.08120},
year={2026}
}
- Downloads last month
- 100