GameDevBench: Evaluating Agentic Capabilities Through Game Development
Abstract
GameDevBench is introduced as the first benchmark for evaluating agents on game development tasks that combine software development complexity with deep multimodal understanding requirements.
Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind. A key challenge is the scarcity of evaluation testbeds that combine the complexity of software development with the need for deep multimodal understanding. Game development provides such a testbed as agents must navigate large, dense codebases while manipulating intrinsically multimodal assets such as shaders, sprites, and animations within a visual game scene. We present GameDevBench, the first benchmark for evaluating agents on game development tasks. GameDevBench consists of 132 tasks derived from web and video tutorials. Tasks require significant multimodal understanding and are complex -- the average solution requires over three times the amount of lines of code and file changes compared to prior software development benchmarks. Agents still struggle with game development, with the best agent solving only 54.5% of tasks. We find a strong correlation between perceived task difficulty and multimodal complexity, with success rates dropping from 46.9% on gameplay-oriented tasks to 31.6% on 2D graphics tasks. To improve multimodal capability, we introduce two simple image and video-based feedback mechanisms for agents. Despite their simplicity, these methods consistently improve performance, with the largest change being an increase in Claude Sonnet 4.5's performance from 33.3% to 47.7%. We release GameDevBench publicly to support further research into agentic game development.
Community
Can agents develop video games?
GameDevBench is the first benchmark to evaluate an agent's ability to solve game development tasks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VirtualEnv: A Platform for Embodied AI Research (2026)
- ProxyWar: Dynamic Assessment of LLM Code Generation in Game Arenas (2026)
- TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents (2026)
- Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image (2025)
- AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios (2026)
- Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents (2026)
- From Gameplay Traces to Game Mechanics: Causal Induction with Large Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper