Papers
arxiv:2601.16973

VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents

Published on Jan 23
· Submitted by
Zirui Wang
on Jan 26
Authors:
,
,
,
,
,
,
,
,

Abstract

Modern vision-language models exhibit significant challenges in multi-step visual interaction tasks, particularly in long-horizon perception-memory-action integration, with performance declining when handling unbounded historical contexts.

AI-generated summary

Modern Vision-Language Models (VLMs) remain poorly characterized in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons. We introduce VisGym, a gymnasium of 17 environments for evaluating and training VLMs. The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation, and provides flexible controls over difficulty, input representation, planning horizon, and feedback. We also provide multi-step solvers that generate structured demonstrations, enabling supervised finetuning. Our evaluations show that all frontier models struggle in interactive settings, achieving low success rates in both the easy (46.6%) and hard (26.0%) configurations. Our experiments reveal notable limitations: models struggle to effectively leverage long context, performing worse with an unbounded history than with truncated windows. Furthermore, we find that several text-based symbolic tasks become substantially harder once rendered visually. However, explicit goal observations, textual feedback, and exploratory demonstrations in partially observable or unknown-dynamics settings for supervised finetuning yield consistent gains, highlighting concrete failure modes and pathways for improving multi-step visual decision-making. Code, data, and models can be found at: https://visgym.github.io/.

Community

Paper author Paper submitter
edited about 3 hours ago

We released VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents.
We systematically study the brittleness of vision-language models in multi-step visual interaction, analyze how training choices shape behavior, and open-source the full benchmark, models, and trajectories.

X: https://x.com/zwcolin/status/2015812327338287227
Project: https://visgym.github.io/
Paper: https://arxiv.org/abs/2601.16973
Code: https://github.com/visgym/VisGym
Data & models: https://huggingface.co/VisGym

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.16973 in a Space README.md to link it from this page.

Collections including this paper 1