MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild Paper • 2603.17187 • Published 7 days ago • 127
WebVR: Benchmarking Multimodal LLMs for WebPage Recreation from Videos via Human-Aligned Visual Rubrics Paper • 2603.13391 • Published 14 days ago • 19
UniG2U-Bench: Do Unified Models Advance Multimodal Understanding? Paper • 2603.03241 • Published 21 days ago • 86
SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale Paper • 2602.23866 • Published 26 days ago • 88
T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning Paper • 2603.03790 • Published 21 days ago • 121
MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants Paper • 2603.09652 • Published 15 days ago • 15
Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports Paper • 2603.09896 • Published 14 days ago • 26
PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation Agents Paper • 2603.08013 • Published 16 days ago • 14
Reasoning Models Struggle to Control their Chains of Thought Paper • 2603.05706 • Published 19 days ago • 34
RubricBench: Aligning Model-Generated Rubrics with Human Standards Paper • 2603.01562 • Published 23 days ago • 60
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL Paper • 2602.22190 • Published 27 days ago • 16
AutoWebWorld: Synthesizing Infinite Verifiable Web Environments via Finite State Machines Paper • 2602.14296 • Published Feb 15 • 51
AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration Paper • 2602.03786 • Published Feb 3 • 89
Weak-Driven Learning: How Weak Agents make Strong Agents Stronger Paper • 2602.08222 • Published Feb 9 • 283
GEBench: Benchmarking Image Generation Models as GUI Environments Paper • 2602.09007 • Published Feb 9 • 39