FutureSim: Replaying World Events to Evaluate Adaptive Agents
Abstract
FutureSim enables evaluation of AI agents' long-term predictive capabilities by simulating chronological real-world event sequences, revealing significant gaps in current forecasting performance.
AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DataClaw: A Process-Oriented Agent Benchmark for Exploratory Real-World Data Analysis (2026)
- World Reasoning Arena (2026)
- FutureWorld: A Live Reinforcement Learning Environment for Predictive Agents with Real-World Outcome Rewards (2026)
- PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory (2026)
- Prediction Arena: Benchmarking AI Models on Real-World Prediction Markets (2026)
- Herculean: An Agentic Benchmark for Financial Intelligence (2026)
- Agentick: A Unified Benchmark for General Sequential Decision-Making Agents (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
