Abstract:AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially, directly correspond to real-world tasks of interest. This paper introduces such a benchmark, designed to measure the accuracy of AI agents in tackling a crucial yet surprisingly challenging aspect of scientific research: computational reproducibility. This task, fundamental to the scientific process, involves reproducing the results of a study using the provided code and data. We introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine). Tasks in CORE-Bench consist of three difficulty levels and include both language-only and vision-language tasks. We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way, saving days of evaluation time for each run compared to a sequential implementation. We evaluated two baseline agents: the general-purpose AutoGPT and a task-specific agent called CORE-Agent. We tested both variants using two underlying language models: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks. Having agents that can reproduce existing work is a necessary step towards building agents that can conduct novel research and could verify and improve the performance of other research agents. We hope that CORE-Bench can improve the state of reproducibility and spur the development of future research agents.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is **the challenges of computational reproducibility in scientific research**. Specifically, the authors focus on how to evaluate and improve the ability of AI agents to reproduce the results of published scientific research. ### Background and Problems 1. **The Current Situation of Computational Reproducibility**: - Although many studies provide code and data, it is still very difficult to reproduce the results of these studies. - Studies in various fields have shown that even with available reproduction materials, many studies still cannot be successfully reproduced (such as in psychology, economics, medicine, etc.). - These difficulties may stem from issues such as unspecified software library versions, different machine architectures or operating systems, and incompatibility between old libraries and new hardware. 2. **The Role of AI Agents**: - AI agents have the potential to assist users in completing various important tasks, including conducting scientific research. - However, there is currently a lack of benchmark tests that can effectively evaluate the performance of AI agents in terms of computational reproducibility. ### Solution: CORE - Bench To meet this challenge, the authors introduced **CORE - Bench (Computational Reproducibility Agent Benchmark)**, a benchmark test specifically designed to evaluate the ability of AI agents in terms of computational reproducibility. - **Task Design**: - CORE - Bench contains 270 tasks, based on 90 papers from the fields of computer science, social science, and medicine. - The tasks are divided into three difficulty levels and involve language processing and visual - language processing tasks. - **Evaluation System**: - A fast and parallelizable evaluation system is provided, saving a great deal of evaluation time. - Two baseline agents (AutoGPT and CORE - Agent) are used for evaluation, using GPT - 4o and GPT - 4o - mini as the underlying language models respectively. ### Results and Significance - **Performance**: - The best agent achieved an accuracy rate of 21% on the most difficult tasks, indicating that there is still much room for improvement. - **Potential Impact**: - Successful agents can help researchers verify the reproducibility of their work, making it easier for independent researchers to replicate past studies. - Conference organizers and journal editors can more efficiently evaluate the reproducibility of submitted research results. Through CORE - Bench, the authors hope to promote the development of computational reproducibility and support the development of future research agents.

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

AI Agents That Matter

ML Research Benchmark

BENCHAGENTS: Automated Benchmark Creation with Agent Interaction

BioKGBench: A Knowledge Graph Checking Benchmark of AI Agent for Biomedical Science

CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices

Codabench: Flexible, Easy-to-use, and Reproducible Meta-Benchmark Platform

WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting

MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation

SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

Towards a Realistic Long-Term Benchmark for Open-Web Research Agents

WebArena: A Realistic Web Environment for Building Autonomous Agents

BenchBot: Evaluating Robotics Research in Photorealistic 3D Simulation and on Real Robots

The ACRV Picking Benchmark (APB): A Robotic Shelf Picking Benchmark to Foster Reproducible Research

BLADE: Benchmarking Language Model Agents for Data-Driven Science

DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?

SAIBench: Benchmarking AI for Science