SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Xiaoxuan Wang,Ziniu Hu,Pan Lu,Yanqiao Zhu,Jieyu Zhang,Satyen Subramaniam,Arjun R. Loomba,Shichang Zhang,Yizhou Sun,Wei Wang
2024-06-28
Abstract:Most of the existing Large Language Model (LLM) benchmarks on scientific problem reasoning focus on problems grounded in high-school subjects and are confined to elementary algebraic operations. To systematically examine the reasoning capabilities required for solving complex scientific problems, we introduce an expansive benchmark suite SciBench for LLMs. SciBench contains a carefully curated dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains. Based on the dataset, we conduct an in-depth benchmarking study of representative open-source and proprietary LLMs with various prompting strategies. The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%. Furthermore, through a detailed user study, we categorize the errors made by LLMs into ten problem-solving abilities. Our analysis indicates that no single prompting strategy significantly outperforms the others and some strategies that demonstrate improvements in certain problem-solving skills could result in declines in other skills. We envision that SciBench will catalyze further developments in the reasoning abilities of LLMs, thereby ultimately contributing to scientific research and discovery.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the issue of capability assessment and skill analysis of Large Language Models (LLMs) in solving complex scientific problems. Most existing LLM benchmarks focus on high school-level math problems or text-based questions, primarily concentrating on basic arithmetic operations, which are insufficient to comprehensively evaluate LLMs' abilities in solving scientific problems that require in-depth reasoning. The paper introduces a new set of benchmarks—SCIBENCH, specifically targeting college-level scientific problems, encompassing complex issues in the fields of mathematics, chemistry, and physics, with the goal of thoroughly assessing LLMs' scientific problem-solving capabilities. The features of SCIBENCH include: 1. **Broad range of problems**: It contains 869 questions collected from widely used college textbooks, involving multi-step reasoning, understanding of scientific concepts, retrieval of specialized knowledge, and complex numerical computation abilities. 2. **Incorporates visual elements**: There are 177 questions that combine visual elements such as graphics and charts, used to evaluate multimodal LLMs. 3. **Detailed solutions**: Step-by-step solutions for example problems are provided, facilitating detailed error analysis. 4. **Real-world scenario simulation**: It includes an independent closed dataset sourced from actual midterm and final exam questions from computer science and mathematics courses, ensuring the authenticity and completeness of the assessment. Through SCIBENCH, the paper evaluates a range of open-source and proprietary LLMs, including various prompting strategies, and allows models to utilize external scientific computing libraries in Python and Wolfram language. The results show that even advanced LLMs have significant performance gaps in solving complex scientific problems, with the highest average score being only 43.22%, indicating substantial room for improvement in this field. Furthermore, the paper proposes a self-improvement method to identify skill deficiencies in LLMs when solving problems. By comparing the correct answers with the answers generated by the models, it summarizes ten key scientific problem-solving skills and automatically categorizes the skills lacking in LLMs under different experimental configurations. These findings help guide the design and optimization of future LLMs to enhance their performance in scientific problem-solving.