AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents
Abstract
AgentProcessBench introduces a benchmark for evaluating step-level effectiveness in tool-augmented agent interactions, featuring diverse trajectories with detailed human annotations to improve process-level understanding and model performance.
While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequently induce irreversible side effects, making accurate step-level verification critical. However, existing process-level benchmarks are predominantly confined to closed-world mathematical domains, failing to capture the dynamic and open-ended nature of tool execution. To bridge this gap, we introduce AgentProcessBench, the first benchmark dedicated to evaluating step-level effectiveness in realistic, tool-augmented trajectories. The benchmark comprises 1,000 diverse trajectories and 8,509 human-labeled step annotations with 89.1% inter-annotator agreement. It features a ternary labeling scheme to capture exploration and an error propagation rule to reduce labeling ambiguity. Extensive experiments reveal key insights: (1) weaker policy models exhibit inflated ratios of correct steps due to early termination; (2) distinguishing neutral and erroneous actions remains a significant challenge for current models; and (3) process-derived signals provide complementary value to outcome supervision, significantly enhancing test-time scaling. We hope AgentProcessBench can foster future research in reward models and pave the way toward general agents. The code and data are available at https://github.com/RUCBM/AgentProcessBench.
Community
๐ ๐จ๐๐๐๐๐ท๐๐๐๐๐๐๐ฉ๐๐๐๐ ๐จ๐๐๐๐๐๐๐๐ ๐ต๐๐
When utilizing Process Reward Models (PRMs) to guide Reinforcement Learning (RL) training, accurately identifying the impact or contribution of each step within a trajectory is essential for providing precise reward signals. To achieve a more rigorous and comprehensive evaluation of models' capabilities as PRMs, we have developed a PRM evaluation benchmark specifically designed for tool-using agents. This benchmark comprises 1,000 trajectories totaling 8,509 steps, all featuring 100% human-annotated labels. Our goal is to provide a more fine-grained testing platform for PRM research within agent-based scenarios.
๐ ๐ฏ๐๐๐๐๐๐๐
rucbm.github.io/AgentProcessBench-Homepage/
๐ค ๐ฏ๐ญ
huggingface.co/datasets/LulaCola/AgentProcessBench
๐ป ๐ฎ๐๐๐๐๐
github.com/RUCBM/AgentProcessBench
๐ ๐๐๐ฟ๐๐
arxiv.org/abs/2603.14465
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AgentRx: Diagnosing AI Agent Failures from Execution Trajectories (2026)
- Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents (2026)
- ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context (2026)
- TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios (2026)
- AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG (2026)
- LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial Scenarios (2026)
- PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper