Abstract:Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations. We address this issue by examining LMs' capabilities to generate code for solving real scientific research problems. Incorporating input from scientists and AI researchers in 16 diverse natural science sub-fields, including mathematics, physics, chemistry, biology, and materials science, we created a scientist-curated coding benchmark, SciCode. The problems in SciCode naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems. It offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. We believe that SciCode demonstrates both contemporary LMs' progress towards becoming helpful scientific assistants and sheds light on the development and evaluation of scientific AI in the future.

What problem does this paper attempt to address?

The paper aims to address the challenges that current language models (LMs) face in evaluating complexity and realism, especially in the scientific domain. As LMs surpass ordinary human performance on many tasks, it becomes increasingly difficult to develop assessments that reflect the actual capabilities and limitations of these models with high quality, realism, and challenge. The paper confronts this challenge by creating a benchmark named SciCode, which focuses on generating code to solve real scientific research problems. SciCode is curated by scientists and AI researchers from 16 different natural science subfields, covering areas such as mathematics, physics, chemistry, biology, and materials science. It contains 80 main problems, broken down into a total of 338 subproblems, each involving knowledge recall, reasoning, and code synthesis. SciCode provides optional descriptions that detail useful scientific background information, as well as scientist-annotated gold standard solutions and test cases for evaluating model performance. The paper notes that even the best-performing model in tests, Claude3.5-Sonnet, can only solve 4.6% of the problems in SciCode under the most realistic settings, highlighting the progress of current LMs in becoming truly useful scientific assistants and the direction for future construction and evaluation of scientific AI. SciCode not only demonstrates the progress of LMs in scientific assistance but also reveals the challenges faced in building and evaluating scientific AI. The design principles of SciCode include focusing on the natural sciences, providing high-quality data, ensuring high annotation quality, selecting realistic and current problems, avoiding overlap with public datasets, testing the comprehensive abilities of models, and allowing the evaluation of various capabilities of models under different settings. Additionally, SciCode takes into account the issue of preventing data contamination and simplifies the problem setting when necessary, providing more background knowledge. By evaluating a range of state-of-the-art proprietary and open-source models on SciCode, the research results show that SciCode is an extremely challenging benchmark, with even the strongest models like Claude3.5-Sonnet only able to solve 4.6% of the main problems under the most realistic evaluation settings. All models benefit from the background knowledge annotated by scientists, but even so, the best models can only solve 12.3% of the main problems. The paper believes that the availability of SciCode can inspire research in artificial intelligence methods to accelerate scientific research, a field that has so far been underutilized due to a lack of commercial incentives, despite recent advances in LMs. By providing such a carefully designed benchmark, it can promote further research and improvement of language models in the scientific domain.

SciCode: A Research Coding Benchmark Curated by Scientists

StarCoder: may the source be with you!

CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems

DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models

The why, what, and how of AI-based coding in scientific research

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

SciAgent: Tool-augmented Language Models for Scientific Reasoning

CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Benchmarking Language Model Creativity: A Case Study on Code Generation

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models