Watch Before You Answer: Learning from Visually Grounded Post-Training
Abstract
Vision-language models face challenges in video understanding due to text-based biases in benchmarks and datasets, which are addressed through VidGround, a method that uses only visually grounded questions for post-training to improve performance.
It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long video understanding benchmarks contain 40-60% of questions that can be answered using text cues alone. Furthermore, we find that these issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding performance. Guided by this observation, we introduce VidGround as a simple yet effective solution: using only the actual visually grounded questions without any linguistic biases for post-training. When used in tandem with RL-based post-training algorithms, this simple technique improves performance by up to 6.2 points relative to using the full dataset, while using only 69.1% of the original post-training data. Moreover, we show that data curation with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is a major bottleneck for improving video understanding in VLMs. These results underscore the importance of curating post-training data and evaluation benchmarks that truly require visual grounding to advance the development of more capable VLMs. Project page: http://vidground.etuagi.com.
Community
TL;DR: We find that 40-60% of questions in popular video understanding benchmarks (VideoMME, MMVU, etc.) can be answered from text alone, and the same problem also exists in post-training datasets. VidGround is a simple fix: keep only the visually grounded questions. Using only 69.1% of the data, it improves RL post-training by up to +6.2 points and beats several more complex post-training methods.
Takeaway: for video understanding in VLMs, visually-grounded data matters more than data volume or algorithmic complexity.
Project page: http://vidground.etuagi.com
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos (2026)
- Selective Training for Large Vision Language Models via Visual Information Gain (2026)
- PatchCue: Enhancing Vision-Language Model Reasoning with Patch-Based Visual Cues (2026)
- Incentivizing Temporal-Awareness in Egocentric Video Understanding Models (2026)
- VisNec: Measuring and Leveraging Visual Necessity for Multimodal Instruction Tuning (2026)
- Reflect to Inform: Boosting Multimodal Reasoning via Information-Gain-Driven Verification (2026)
- Video-Oasis: Rethinking Evaluation of Video Understanding (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.05117 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper