CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability Paper • 2602.03012 • Published 15 days ago • 2
V_0: A Generalist Value Model for Any Policy at State Zero Paper • 2602.03584 • Published 14 days ago • 21
CoBA-RL: Capability-Oriented Budget Allocation for Reinforcement Learning in LLMs Paper • 2602.03048 • Published 15 days ago • 33
ToolACE-MCP: Generalizing History-Aware Routing from MCP Tools to the Agent Web Paper • 2601.08276 • Published Jan 13 • 7
MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences Paper • 2601.06789 • Published Jan 11 • 78
RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction Paper • 2601.06966 • Published Jan 11 • 9
Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning Paper • 2601.06943 • Published Jan 11 • 212
DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle Paper • 2512.04324 • Published Dec 3, 2025 • 154
InteractComp: Evaluating Search Agents With Ambiguous Queries Paper • 2510.24668 • Published Oct 28, 2025 • 98
ReCode: Unify Plan and Action for Universal Granularity Control Paper • 2510.23564 • Published Oct 27, 2025 • 122
LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries Paper • 2508.15760 • Published Aug 21, 2025 • 47 • 9