CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

Zachary S. Siegel,Sayash Kapoor,Nitya Nagdir,Benedikt Stroebl,Arvind Narayanan
2024-09-18
Abstract:AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially, directly correspond to real-world tasks of interest. This paper introduces such a benchmark, designed to measure the accuracy of AI agents in tackling a crucial yet surprisingly challenging aspect of scientific research: computational reproducibility. This task, fundamental to the scientific process, involves reproducing the results of a study using the provided code and data. We introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine). Tasks in CORE-Bench consist of three difficulty levels and include both language-only and vision-language tasks. We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way, saving days of evaluation time for each run compared to a sequential implementation. We evaluated two baseline agents: the general-purpose AutoGPT and a task-specific agent called CORE-Agent. We tested both variants using two underlying language models: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks. Having agents that can reproduce existing work is a necessary step towards building agents that can conduct novel research and could verify and improve the performance of other research agents. We hope that CORE-Bench can improve the state of reproducibility and spur the development of future research agents.
Computation and Language,Artificial Intelligence,Multiagent Systems
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **the challenges of computational reproducibility in scientific research**. Specifically, the authors focus on how to evaluate and improve the ability of AI agents to reproduce the results of published scientific research. ### Background and Problems 1. **The Current Situation of Computational Reproducibility**: - Although many studies provide code and data, it is still very difficult to reproduce the results of these studies. - Studies in various fields have shown that even with available reproduction materials, many studies still cannot be successfully reproduced (such as in psychology, economics, medicine, etc.). - These difficulties may stem from issues such as unspecified software library versions, different machine architectures or operating systems, and incompatibility between old libraries and new hardware. 2. **The Role of AI Agents**: - AI agents have the potential to assist users in completing various important tasks, including conducting scientific research. - However, there is currently a lack of benchmark tests that can effectively evaluate the performance of AI agents in terms of computational reproducibility. ### Solution: CORE - Bench To meet this challenge, the authors introduced **CORE - Bench (Computational Reproducibility Agent Benchmark)**, a benchmark test specifically designed to evaluate the ability of AI agents in terms of computational reproducibility. - **Task Design**: - CORE - Bench contains 270 tasks, based on 90 papers from the fields of computer science, social science, and medicine. - The tasks are divided into three difficulty levels and involve language processing and visual - language processing tasks. - **Evaluation System**: - A fast and parallelizable evaluation system is provided, saving a great deal of evaluation time. - Two baseline agents (AutoGPT and CORE - Agent) are used for evaluation, using GPT - 4o and GPT - 4o - mini as the underlying language models respectively. ### Results and Significance - **Performance**: - The best agent achieved an accuracy rate of 21% on the most difficult tasks, indicating that there is still much room for improvement. - **Potential Impact**: - Successful agents can help researchers verify the reproducibility of their work, making it easier for independent researchers to replicate past studies. - Conference organizers and journal editors can more efficiently evaluate the reproducibility of submitted research results. Through CORE - Bench, the authors hope to promote the development of computational reproducibility and support the development of future research agents.