HippoCamp: Benchmarking Contextual Agents on Personal Computers
Abstract
HippoCamp is a multimodal file management benchmark that evaluates agents' capabilities in user-centric environments, revealing significant performance gaps in long-horizon retrieval and cross-modal reasoning within dense personal file systems.
We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide 46.1K densely annotated structured trajectories for step-wise failure diagnosis. We evaluate a wide range of state-of-the-art multimodal large language models (MLLMs) and agentic methods on HippoCamp. Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve only 48.3% accuracy in user profiling, struggling particularly with long-horizon retrieval and cross-modal reasoning within dense personal file systems. Furthermore, our step-wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks. Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user-centric environments and provides a robust foundation for developing next-generation personal AI assistants.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- BrowseComp-V3: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents (2026)
- LiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges (2026)
- ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context (2026)
- GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents (2026)
- AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios (2026)
- MuSEAgent: A Multimodal Reasoning Agent with Stateful Experiences (2026)
- MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper