ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research
Abstract
ResearchClawBench evaluates autonomous scientific research capabilities across 40 tasks from 10 domains using expert-curated criteria and reveals current limitations in re-discovery accuracy among AI agents and LLMs.
AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.
Community
AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.
the most interesting part is how far even the best auto-research systems are from reliable re-discovery when the target paper sits behind a weighted, multimodal rubric. could you share a sensitivity analysis on the rubric weights—if you tilt the balance toward experimental protocol correctness versus evidence alignment, do the agent rankings shift, or are failures dominated by a couple of high-weight criteria? also curious about edge cases, like what happens when the associated raw data is noisy or the evaluated protocol diverges slightly from the original—does the system still retrieve the right artifacts? the arxivlens breakdown helped me parse the method details; btw the arxivlens summary covers section 3 in a way that's easy to skim: https://arxivlens.com/PaperView/Details/researchclawbench-a-benchmark-for-end-to-end-autonomous-scientific-research-222-c7698706
Thank you so much for your thoughtful questions and for looking into our work!
On sensitivity analysis of rubric weights: We have analyzed the failure modes of most evaluated models. The dominant issue is not the weighting of rubric criteria, but rather that the agents often fail to properly understand and plan the scientific workflow according to the task instructions. As a result, their final reports miss the most critical information required by the task. In other words, the low scores are primarily driven by a drift in the core scientific conclusions, not by how the rubric weights are distributed. Shifting weights between experimental protocol correctness and evidence alignment would likely not change the ranking significantly, because most failures occur at a more fundamental level — the agents simply do not retrieve or reason over the right artifacts in the first place.
On noisy data and protocol deviations: Yes, noisy or imperfect data certainly increases task difficulty — but that is by design, as real-world scientific research is inherently open-ended and rarely comes with clean, well-defined datasets. In our benchmark, the ResearchHarness baseline is equipped with tools like web search, allowing agents to actively repair, supplement, or replace missing or noisy data, mimicking how human scientists would act. That said, this capability remains very challenging and is not yet reliably achieved. As for protocol deviations — we do not require agents to follow the exact methodology of the original paper. Instead, we adopt an outcome-focused evaluation, measuring how well the AI’s core findings align with the target paper’s key conclusions. This aligns with real scientific practice: “All roads lead to Rome.”
Thank you again for your interest — your feedback is very valuable to us!
Get this paper in your agent:
hf papers read 2606.07591 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper


