Abstract:Large language models (LLMs) have gained increasing prominence in scientific research, but there is a lack of comprehensive benchmarks to fully evaluate their proficiency in understanding and mastering scientific knowledge. To address this need, we introduce the SciKnowEval benchmark, a novel framework that systematically evaluates LLMs across five progressive levels of scientific knowledge: studying extensively, inquiring earnestly, thinking profoundly, discerning clearly, and practicing assiduously. These levels aim to assess the breadth and depth of scientific knowledge in LLMs, including memory, comprehension, reasoning, discernment, and application. Specifically, we first construct a large-scale evaluation dataset encompassing 70K multi-level scientific problems and solutions in the domains of biology, chemistry, physics, and materials science. By leveraging this dataset, we benchmark 26 advanced open-source and proprietary LLMs using zero-shot and few-shot prompting strategies. The results reveal that despite the state-of-the-art performance of proprietary LLMs, there is still significant room for improvement, particularly in addressing scientific reasoning and applications. We anticipate that SciKnowEval will establish a standard for benchmarking LLMs in science research and promote the development of stronger scientific LLMs. The dataset and code are publicly available at <a class="link-external link-https" href="https://scimind.ai/sciknoweval" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: currently, there is a lack of comprehensive benchmarks for fully evaluating the capabilities of large language models (LLMs) in terms of scientific knowledge understanding and mastery. Specifically, the existing benchmarks have the following deficiencies: 1. **Lack of depth and breadth**: Many existing benchmarks only cover high - school - level scientific questions and cannot fully explore the potential of LLMs in deeper - level scientific tasks. 2. **Imperfect evaluation system**: Although some benchmarks specifically for the scientific field involve more professional scientific tasks, they lack a systematic evaluation framework, resulting in a limited understanding of the model's capabilities. 3. **Neglect of safety and ethical issues**: Most benchmarks ignore the evaluation of safety and ethical issues in scientific research. To solve these problems, the author introduced SciKnowEval, a new framework aimed at systematically evaluating the scientific knowledge processing capabilities of LLMs at five progressive levels, including: - **Extensive learning (memory level)**: Evaluate the knowledge breadth of LLMs in different scientific fields. - **In - depth exploration (understanding level)**: Examine the questioning and exploration capabilities of LLMs in a scientific context. - **Deep thinking (reasoning level)**: Test the critical thinking, logical deduction, and problem - solving capabilities of LLMs. - **Distinguish right from wrong (judgment level)**: Evaluate the ability of LLMs to make correct, safe, and ethical decisions based on scientific knowledge. - **Solid action (application level)**: Measure the ability of LLMs to effectively apply scientific knowledge in actual scenarios. By constructing a large - scale dataset containing more than 70,000 multi - level scientific questions and solutions and using zero - shot and few - shot prompting strategies to benchmark 26 advanced open - source and proprietary LLMs, the study reveals that although proprietary LLMs perform well, there is still much room for improvement in scientific reasoning and application. SciKnowEval is expected to become a standard tool for evaluating LLMs in the scientific field and promote the development of more powerful scientific LLMs.

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

KoLA: Carefully Benchmarking World Knowledge of Large Language Models

LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models

SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis

SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks

VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning

SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading

MultifacetEval: Multifaceted Evaluation to Probe LLMs in Mastering Medical Knowledge

M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models

MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding

KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models

E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

LawBench: Benchmarking Legal Knowledge of Large Language Models

LexEval: A Comprehensive Chinese Legal Benchmark for Evaluating Large Language Models

ChemEval: A Comprehensive Multi-Level Chemical Evaluation for Large Language Models