SciCode: A Research Coding Benchmark Curated by Scientists

Minyang Tian,Luyu Gao,Shizhuo Dylan Zhang,Xinan Chen,Cunwei Fan,Xuefei Guo,Roland Haas,Pan Ji,Kittithat Krongchon,Yao Li,Shengyan Liu,Di Luo,Yutao Ma,Hao Tong,Kha Trinh,Chenyu Tian,Zihan Wang,Bohao Wu,Yanyu Xiong,Shengzhu Yin,Minhui Zhu,Kilian Lieret,Yanxin Lu,Genglin Liu,Yufeng Du,Tianhua Tao,Ofir Press,Jamie Callan,Eliu Huerta,Hao Peng
2024-07-18
Abstract:Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations. We address this issue by examining LMs' capabilities to generate code for solving real scientific research problems. Incorporating input from scientists and AI researchers in 16 diverse natural science sub-fields, including mathematics, physics, chemistry, biology, and materials science, we created a scientist-curated coding benchmark, SciCode. The problems in SciCode naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems. It offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. We believe that SciCode demonstrates both contemporary LMs' progress towards becoming helpful scientific assistants and sheds light on the development and evaluation of scientific AI in the future.
Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The paper aims to address the challenges that current language models (LMs) face in evaluating complexity and realism, especially in the scientific domain. As LMs surpass ordinary human performance on many tasks, it becomes increasingly difficult to develop assessments that reflect the actual capabilities and limitations of these models with high quality, realism, and challenge. The paper confronts this challenge by creating a benchmark named SciCode, which focuses on generating code to solve real scientific research problems. SciCode is curated by scientists and AI researchers from 16 different natural science subfields, covering areas such as mathematics, physics, chemistry, biology, and materials science. It contains 80 main problems, broken down into a total of 338 subproblems, each involving knowledge recall, reasoning, and code synthesis. SciCode provides optional descriptions that detail useful scientific background information, as well as scientist-annotated gold standard solutions and test cases for evaluating model performance. The paper notes that even the best-performing model in tests, Claude3.5-Sonnet, can only solve 4.6% of the problems in SciCode under the most realistic settings, highlighting the progress of current LMs in becoming truly useful scientific assistants and the direction for future construction and evaluation of scientific AI. SciCode not only demonstrates the progress of LMs in scientific assistance but also reveals the challenges faced in building and evaluating scientific AI. The design principles of SciCode include focusing on the natural sciences, providing high-quality data, ensuring high annotation quality, selecting realistic and current problems, avoiding overlap with public datasets, testing the comprehensive abilities of models, and allowing the evaluation of various capabilities of models under different settings. Additionally, SciCode takes into account the issue of preventing data contamination and simplifies the problem setting when necessary, providing more background knowledge. By evaluating a range of state-of-the-art proprietary and open-source models on SciCode, the research results show that SciCode is an extremely challenging benchmark, with even the strongest models like Claude3.5-Sonnet only able to solve 4.6% of the main problems under the most realistic evaluation settings. All models benefit from the background knowledge annotated by scientists, but even so, the best models can only solve 12.3% of the main problems. The paper believes that the availability of SciCode can inspire research in artificial intelligence methods to accelerate scientific research, a field that has so far been underutilized due to a lack of commercial incentives, despite recent advances in LMs. By providing such a carefully designed benchmark, it can promote further research and improvement of language models in the scientific domain.