Papers
arxiv:2602.06855

AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents

Published on Feb 6
· Submitted by
Bhavul Gauri
on Feb 10
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

AIRS-Bench presents a comprehensive benchmark suite for evaluating LLM agents across diverse scientific domains, demonstrating current limitations while providing open-source resources for advancing autonomous scientific research.

AI-generated summary

LLM agents hold significant promise for advancing scientific research. To accelerate this progress, we introduce AIRS-Bench (the AI Research Science Benchmark), a suite of 20 tasks sourced from state-of-the-art machine learning papers. These tasks span diverse domains, including language modeling, mathematics, bioinformatics, and time series forecasting. AIRS-Bench tasks assess agentic capabilities over the full research lifecycle -- including idea generation, experiment analysis and iterative refinement -- without providing baseline code. The AIRS-Bench task format is versatile, enabling easy integration of new tasks and rigorous comparison across different agentic frameworks. We establish baselines using frontier models paired with both sequential and parallel scaffolds. Our results show that agents exceed human SOTA in four tasks but fail to match it in sixteen others. Even when agents surpass human benchmarks, they do not reach the theoretical performance ceiling for the underlying tasks. These findings indicate that AIRS-Bench is far from saturated and offers substantial room for improvement. We open-source the AIRS-Bench task definitions and evaluation code to catalyze further development in autonomous scientific research.

Community

Paper submitter

We are introducing AIRS-bench, asking agents to beat human SOTA on 20 research tasks from recent ML papers (from NLP, math and coding to biochemical modeling and time series prediction). We provide no baseline code, and assess end-to-end research abilities, from idea generation, methodology, experiment analysis, and iterative refinement. Read the paper to see how well the agents did!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.06855 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.06855 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.06855 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.