AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints
Abstract
AdaPlanBench presents a dynamic interactive benchmark for evaluating LLM agents' ability to adaptively plan under progressively revealed world and user constraints through multi-turn interactions.
Planning for real-world problems by language models often involves both world and user constraints, which may not be fully specified upfront and are progressively disclosed through interaction. However, existing benchmarks still underexplore adaptive planning under such progressively revealed dual constraints. To address this gap, we introduce AdaPlanBench, a dynamic interactive benchmark for evaluating whether Large Language Model (LLM) agents can adaptively plan and re-plan under progressively revealed world and user constraints. AdaPlanBench is built on 307 household tasks, with a scalable constraint construction pipeline that augments each task with dual constraints. At runtime, agents interact with the environment in a multi-turn protocol where hidden constraints are revealed only when the agent proposes a plan that violates them, requiring iterative plan revision under accumulating feedback. This makes planning challenging, as agents must infer and track constraints from feedback while re-planning effectively. Experiments on ten leading LLMs show that adaptive planning under dual constraints remains challenging, with the best model reaching only 67.75% accuracy. We further observe that performance degrades as more constraints accumulate, with user constraints posing a particularly large challenge and failures often stemming from weaker physical grounding and reduced effectiveness. These results establish AdaPlanBench as a testbed for dual-constrained interactive planning and highlight the challenge of reliable adaptation to dynamically revealed constraints in LLM agents.
Community
Excited to share AdaPlanBench, a benchmark for studying how LLM agents adaptively re-plan as hidden world constraints and user preferences emerge.
the thing that sticks out most is the dual constraint construction pipeline and its three-round sampling to build e_low, e_mid, e_high. i’m curious how sensitive the reported adaptive planning performance is to the constraint distribution you generate; if world vs user constraints are not balanced across tasks, the scores might reflect the sampling bias more than the agent’s true adaptability. an ablation that varies the constraint sampling strategy or tests with fixed vs progressively disclosed constraints would help separate plan revision ability from constraint inference. btw the arxivlens breakdown helped me parse the method details, especially the constraint pipeline; it made me think about how to complement the evaluation with a stress test for skewed constraint mixes. https://arxivlens.com/PaperView/Details/adaplanbench-evaluating-adaptive-planning-in-large-language-model-agents-under-world-and-user-constraints-5253-a49a8bbf
Thanks for the thoughtful comment! I agree that constraint-distribution sensitivity is an important issue. In our current setup, we try to mitigate this in two ways: first, the generated world and user constraints are roughly balanced in average count across the low/mid/high profiles; second, the construction pipeline uses multiple planner samplers with iterative sampling and aggregation, rather than relying on a single sampler.
We also partially study this through the low/mid/high constraint-burden analysis and the world-only/user-only/both-constraints ablation. That said, I agree that an intentionally skewed constraint-mix stress test would be a very useful complementary evaluation. Thanks again for the careful reading!
Get this paper in your agent:
hf papers read 2606.05622 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper