Papers
arxiv:2604.05117

Watch Before You Answer: Learning from Visually Grounded Post-Training

Published on Apr 6
· Submitted by
Perry the Platypus
on Apr 8
Authors:
,
,
,
,
,
,
,

Abstract

Vision-language models face challenges in video understanding due to text-based biases in benchmarks and datasets, which are addressed through VidGround, a method that uses only visually grounded questions for post-training to improve performance.

AI-generated summary

It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long video understanding benchmarks contain 40-60% of questions that can be answered using text cues alone. Furthermore, we find that these issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding performance. Guided by this observation, we introduce VidGround as a simple yet effective solution: using only the actual visually grounded questions without any linguistic biases for post-training. When used in tandem with RL-based post-training algorithms, this simple technique improves performance by up to 6.2 points relative to using the full dataset, while using only 69.1% of the original post-training data. Moreover, we show that data curation with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is a major bottleneck for improving video understanding in VLMs. These results underscore the importance of curating post-training data and evaluation benchmarks that truly require visual grounding to advance the development of more capable VLMs. Project page: http://vidground.etuagi.com.

Community

Paper submitter

TL;DR: We find that 40-60% of questions in popular video understanding benchmarks (VideoMME, MMVU, etc.) can be answered from text alone, and the same problem also exists in post-training datasets. VidGround is a simple fix: keep only the visually grounded questions. Using only 69.1% of the data, it improves RL post-training by up to +6.2 points and beats several more complex post-training methods.

Takeaway: for video understanding in VLMs, visually-grounded data matters more than data volume or algorithmic complexity.

Project page: http://vidground.etuagi.com

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.05117
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.05117 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.05117 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.05117 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.