arxiv:2605.22535

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

Published on May 21

· Submitted by

taesiri on May 22

Upvote

Authors:

Zhaoyang Chu ,

Abstract

We introduce TerminalWorld, a scalable data engine that automatically reverse-engineers high-fidelity evaluation tasks from "in-the-wild" terminal recordings. Processing 80,870 terminal recordings, the engine yields a full benchmark of 1,530 validated tasks, spanning 18 real-world categories, ranging from short everyday operations to workflows exceeding 50 steps, and covering 1,280 unique commands. From these, we curate a Verified subset of 200 representative, manually reviewed tasks. Comprehensive benchmarking on TerminalWorld-Verified across eight frontier models and six agents reveals that current systems still struggle with authentic terminal workflows, achieving a maximum pass rate of only 62.5%. Moreover, TerminalWorld captures real-world terminal capabilities distinct from existing expert-curated benchmarks (e.g., Terminal-Bench), with only a weak correlation to their scores (Pearson r=0.20). The automated engine makes TerminalWorld authentic and scalable by construction, enabling it to evaluate agents in real-world terminal environments as developer practices evolve. Data and code are available at https://github.com/EuniAI/TerminalWorld.

View arXiv page View PDF GitHub 3 Add to collection

Community

taesiri

Paper submitter about 15 hours ago

TerminalWorld is a scalable data engine that reverse-engineers real-world terminal recordings into a benchmark of 1,530 validated tasks to evaluate agent performance on authentic software engineering terminal workflows.

chuzy

Paper author about 1 hour ago

Thanks for sharing our paper! 🙏

We built TerminalWorld to ask a practical question: Can terminal agents handle real-world human workflows?

Terminal-Bench is an important benchmark for terminal agents, while TerminalWorld takes a complementary path:
✍️ Manually authored tasks → Terminal-Bench
🔴 Real human terminal recordings → TerminalWorld

📊 From 80,870 public human terminal recordings, we reverse-engineer 1,530 validated terminal tasks, including a 200-task manually verified subset.

🧰 Each task includes a natural language instruction, human reference solution, Docker environment, and test suits.

🌍 TerminalWorld covers 18 real-world terminal categories and 1,280 unique tools/commands, including containers, CI/CD, cloud infrastructure, system administration, environment setup, and software build/testing.

🧪 We evaluate frontier LLMs and terminal agents. Even the best model reaches only 62.5% pass rate, showing that authentic terminal workflows remain challenging.

Feedback is very welcome! 🚀

Paper: https://arxiv.org/abs/2605.22535
Code: https://github.com/EuniAI/TerminalWorld
Dataset: https://huggingface.co/datasets/EuniAI/TerminalWorld

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.22535

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.22535 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.22535 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.