Papers
arxiv:2605.26340

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

Published on May 25
· Submitted by
Rui
on May 28
Authors:
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Autonomous research agents exhibit verifiability issues like fabricated citations and unreproducible results, which are addressed through a framework ensuring evidence traceability and an end-to-end system maintaining integrity throughout research processes.

AI-generated summary

Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks -- score verification, specification violation, reference verification, and method-code alignment -- apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.

Community

Paper submitter

We audited 75 AI-generated research papers and found that every baseline exhibits at least one systematic integrity failure: unsupported claims, fabricated citations, or method-code misalignment.

ScientistOne introduces Chain-of-Evidence — every claim traces to code, data, or literature. It is the only system to achieve zero hallucinated references, perfect score verification, and the highest method-code alignment, while matching or exceeding human expert performance on frontier research benchmarks.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.26340
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.26340 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.26340 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.26340 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.