arxiv:2606.07591

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Published on May 28

· Submitted by

Wanghan Xu on Jun 8

#2 Paper of the day

Upvote

Authors:

Wanghan Xu ,

Haoxiang Yin ,

Abstract

ResearchClawBench evaluates autonomous scientific research capabilities across 40 tasks from 10 domains using expert-curated criteria and reveals current limitations in re-discovery accuracy among AI agents and LLMs.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.

View arXiv page View PDF Project page GitHub 145 Add to collection

Community

black-yt

Paper author Paper submitter about 14 hours ago

black-yt

Paper author Paper submitter about 11 hours ago

We recently updated more evaluation results for agents and LLMs on our homepage. You can click to see the papers written by different agents along with their scores — it's very interesting.

black-yt

Paper author Paper submitter about 11 hours ago

avahal

about 3 hours ago

the most interesting part is how far even the best auto-research systems are from reliable re-discovery when the target paper sits behind a weighted, multimodal rubric. could you share a sensitivity analysis on the rubric weights—if you tilt the balance toward experimental protocol correctness versus evidence alignment, do the agent rankings shift, or are failures dominated by a couple of high-weight criteria? also curious about edge cases, like what happens when the associated raw data is noisy or the evaluated protocol diverges slightly from the original—does the system still retrieve the right artifacts? the arxivlens breakdown helped me parse the method details; btw the arxivlens summary covers section 3 in a way that's easy to skim: https://arxivlens.com/PaperView/Details/researchclawbench-a-benchmark-for-end-to-end-autonomous-scientific-research-222-c7698706

black-yt

Paper author Paper submitter about 3 hours ago

Thank you so much for your thoughtful questions and for looking into our work!

On sensitivity analysis of rubric weights: We have analyzed the failure modes of most evaluated models. The dominant issue is not the weighting of rubric criteria, but rather that the agents often fail to properly understand and plan the scientific workflow according to the task instructions. As a result, their final reports miss the most critical information required by the task. In other words, the low scores are primarily driven by a drift in the core scientific conclusions, not by how the rubric weights are distributed. Shifting weights between experimental protocol correctness and evidence alignment would likely not change the ranking significantly, because most failures occur at a more fundamental level — the agents simply do not retrieve or reason over the right artifacts in the first place.
On noisy data and protocol deviations: Yes, noisy or imperfect data certainly increases task difficulty — but that is by design, as real-world scientific research is inherently open-ended and rarely comes with clean, well-defined datasets. In our benchmark, the ResearchHarness baseline is equipped with tools like web search, allowing agents to actively repair, supplement, or replace missing or noisy data, mimicking how human scientists would act. That said, this capability remains very challenging and is not yet reliably achieved. As for protocol deviations — we do not require agents to follow the exact methodology of the original paper. Instead, we adopt an outcome-focused evaluation, measuring how well the AI’s core findings align with the target paper’s key conclusions. This aligns with real scientific practice: “All roads lead to Rome.”

Thank you again for your interest — your feedback is very valuable to us!