arxiv:2606.04433

Stateful Visual Encoders for Vision-Language Models

Published on Jun 3

· Submitted by

Zirui Wang on Jun 3

Voio Inc

Upvote

Authors:

Junwei Yu ,

Abstract

Stateful visual encoders condition visual representations on prior features, improving visual comparison tasks in vision-language models.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Vision-language models (VLMs) are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. However, in existing open-weight VLMs, visual comparisons happen only inside the language model, while the visual encoder itself remains stateless: each image is encoded independently, without access to the prior visual context. As a result, small but task-critical changes may be attenuated before the language model has a chance to compare them, especially when those changes do not affect the high-level semantics of the scene. We introduce a Stateful Visual Encoder, which conditions each visual representation on prior visual features. Under supervised finetuning, VLMs equipped with stateful encoders achieve consistent improvements on controlled tasks involving cross-image spatial aggregation, multi-object visual differencing, and visual trajectory behavior cloning. These improvements are consistent across input resolutions, language model sizes, and VLM backbones. Finally, we validate our model on real-world tasks, including longitudinal radiology, fine-grained image comparison, and remote sensing, where stateful encoders consistently improve generalist VLM baselines and can match or surpass specialized models in selected domains. Project page: https://statefulvisualencoders.github.io/

View arXiv page View PDF Project page GitHub 4 Add to collection

Community

zwcolin

Paper submitter about 19 hours ago

👀Humans compare images by looking back and forth. Many open-weight VLMs encode each image independently, and defer comparison to the LM.

We introduce SVE: Stateful Visual Encoders for Vision-Language Models, where the visual encoder itself becomes change-aware.

🌐Project: https://statefulvisualencoders.github.io
📰Paper: https://arxiv.org/abs/2606.04433
💻Code: https://github.com/StatefulVisualEncoders/StatefulVisualEncoders

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.04433

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.04433 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.04433 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.04433 in a Space README.md to link it from this page.