SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

Kehua Feng,Keyan Ding,Weijie Wang,Xiang Zhuang,Zeyuan Wang,Ming Qin,Yu Zhao,Jianhua Yao,Qiang Zhang,Huajun Chen
2024-10-08
Abstract:Large language models (LLMs) have gained increasing prominence in scientific research, but there is a lack of comprehensive benchmarks to fully evaluate their proficiency in understanding and mastering scientific knowledge. To address this need, we introduce the SciKnowEval benchmark, a novel framework that systematically evaluates LLMs across five progressive levels of scientific knowledge: studying extensively, inquiring earnestly, thinking profoundly, discerning clearly, and practicing assiduously. These levels aim to assess the breadth and depth of scientific knowledge in LLMs, including memory, comprehension, reasoning, discernment, and application. Specifically, we first construct a large-scale evaluation dataset encompassing 70K multi-level scientific problems and solutions in the domains of biology, chemistry, physics, and materials science. By leveraging this dataset, we benchmark 26 advanced open-source and proprietary LLMs using zero-shot and few-shot prompting strategies. The results reveal that despite the state-of-the-art performance of proprietary LLMs, there is still significant room for improvement, particularly in addressing scientific reasoning and applications. We anticipate that SciKnowEval will establish a standard for benchmarking LLMs in science research and promote the development of stronger scientific LLMs. The dataset and code are publicly available at <a class="link-external link-https" href="https://scimind.ai/sciknoweval" rel="external noopener nofollow">this https URL</a> .
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: currently, there is a lack of comprehensive benchmarks for fully evaluating the capabilities of large language models (LLMs) in terms of scientific knowledge understanding and mastery. Specifically, the existing benchmarks have the following deficiencies: 1. **Lack of depth and breadth**: Many existing benchmarks only cover high - school - level scientific questions and cannot fully explore the potential of LLMs in deeper - level scientific tasks. 2. **Imperfect evaluation system**: Although some benchmarks specifically for the scientific field involve more professional scientific tasks, they lack a systematic evaluation framework, resulting in a limited understanding of the model's capabilities. 3. **Neglect of safety and ethical issues**: Most benchmarks ignore the evaluation of safety and ethical issues in scientific research. To solve these problems, the author introduced SciKnowEval, a new framework aimed at systematically evaluating the scientific knowledge processing capabilities of LLMs at five progressive levels, including: - **Extensive learning (memory level)**: Evaluate the knowledge breadth of LLMs in different scientific fields. - **In - depth exploration (understanding level)**: Examine the questioning and exploration capabilities of LLMs in a scientific context. - **Deep thinking (reasoning level)**: Test the critical thinking, logical deduction, and problem - solving capabilities of LLMs. - **Distinguish right from wrong (judgment level)**: Evaluate the ability of LLMs to make correct, safe, and ethical decisions based on scientific knowledge. - **Solid action (application level)**: Measure the ability of LLMs to effectively apply scientific knowledge in actual scenarios. By constructing a large - scale dataset containing more than 70,000 multi - level scientific questions and solutions and using zero - shot and few - shot prompting strategies to benchmark 26 advanced open - source and proprietary LLMs, the study reveals that although proprietary LLMs perform well, there is still much room for improvement in scientific reasoning and application. SciKnowEval is expected to become a standard tool for evaluating LLMs in the scientific field and promote the development of more powerful scientific LLMs.