Abstract:Most of the existing Large Language Model (LLM) benchmarks on scientific problem reasoning focus on problems grounded in high-school subjects and are confined to elementary algebraic operations. To systematically examine the reasoning capabilities required for solving complex scientific problems, we introduce an expansive benchmark suite SciBench for LLMs. SciBench contains a carefully curated dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains. Based on the dataset, we conduct an in-depth benchmarking study of representative open-source and proprietary LLMs with various prompting strategies. The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%. Furthermore, through a detailed user study, we categorize the errors made by LLMs into ten problem-solving abilities. Our analysis indicates that no single prompting strategy significantly outperforms the others and some strategies that demonstrate improvements in certain problem-solving skills could result in declines in other skills. We envision that SciBench will catalyze further developments in the reasoning abilities of LLMs, thereby ultimately contributing to scientific research and discovery.

What problem does this paper attempt to address?

The paper aims to address the issue of capability assessment and skill analysis of Large Language Models (LLMs) in solving complex scientific problems. Most existing LLM benchmarks focus on high school-level math problems or text-based questions, primarily concentrating on basic arithmetic operations, which are insufficient to comprehensively evaluate LLMs' abilities in solving scientific problems that require in-depth reasoning. The paper introduces a new set of benchmarks—SCIBENCH, specifically targeting college-level scientific problems, encompassing complex issues in the fields of mathematics, chemistry, and physics, with the goal of thoroughly assessing LLMs' scientific problem-solving capabilities. The features of SCIBENCH include: 1. **Broad range of problems**: It contains 869 questions collected from widely used college textbooks, involving multi-step reasoning, understanding of scientific concepts, retrieval of specialized knowledge, and complex numerical computation abilities. 2. **Incorporates visual elements**: There are 177 questions that combine visual elements such as graphics and charts, used to evaluate multimodal LLMs. 3. **Detailed solutions**: Step-by-step solutions for example problems are provided, facilitating detailed error analysis. 4. **Real-world scenario simulation**: It includes an independent closed dataset sourced from actual midterm and final exam questions from computer science and mathematics courses, ensuring the authenticity and completeness of the assessment. Through SCIBENCH, the paper evaluates a range of open-source and proprietary LLMs, including various prompting strategies, and allows models to utilize external scientific computing libraries in Python and Wolfram language. The results show that even advanced LLMs have significant performance gaps in solving complex scientific problems, with the highest average score being only 43.22%, indicating substantial room for improvement in this field. Furthermore, the paper proposes a self-improvement method to identify skill deficiencies in LLMs when solving problems. By comparing the correct answers with the answers generated by the models, it summarizes ten key scientific problem-solving skills and automatically categorizes the skills lacking in LLMs under different experimental configurations. These findings help guide the design and optimization of future LLMs to enhance their performance in scientific problem-solving.

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

CLR-Bench: Evaluating Large Language Models in College-level Reasoning

NLPBench: Evaluating Large Language Models on Solving NLP Problems

CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading

Benchmarking Large Language Models for Math Reasoning Tasks

SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks

CriticBench: Benchmarking LLMs for Critique-Correct Reasoning

MaterialBENCH: Evaluating College-Level Materials Science Problem-Solving Abilities of Large Language Models

MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data

CLR-Fact: Evaluating the Complex Logical Reasoning Capability of Large Language Models over Factual Knowledge

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

SciAgent: Tool-augmented Language Models for Scientific Reasoning

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis