GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine? Paper • 2606.17861 • Published 12 days ago • 58
GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine? Paper • 2606.17861 • Published 12 days ago • 58
PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions Paper • 2606.14832 • Published 16 days ago • 12
PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions Paper • 2606.14832 • Published 16 days ago • 12
PhoneWorld: Scaling Phone-Use Agent Environments Paper • 2605.29486 • Published about 1 month ago • 11
PhoneWorld: Scaling Phone-Use Agent Environments Paper • 2605.29486 • Published about 1 month ago • 11
PhoneWorld: Scaling Phone-Use Agent Environments Paper • 2605.29486 • Published about 1 month ago • 11
Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents Paper • 2605.07630 • Published May 8 • 1
Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents Paper • 2605.07630 • Published May 8 • 1
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows Paper • 2604.28139 • Published Apr 30 • 42
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows Paper • 2604.28139 • Published Apr 30 • 42
Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning Paper • 2604.16029 • Published Apr 17 • 23
Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning Paper • 2604.16029 • Published Apr 17 • 23
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models Paper • 2604.10866 • Published Apr 13 • 68