Abstract:There is a growing interest in the role that LLMs play in chemistry which lead to an increased focus on the development of LLMs benchmarks tailored to chemical domains to assess the performance of LLMs across a spectrum of chemical tasks varying in type and complexity. However, existing benchmarks in this domain fail to adequately meet the specific requirements of chemical research professionals. To this end, we propose \textbf{\textit{ChemEval}}, which provides a comprehensive assessment of the capabilities of LLMs across a wide range of chemical domain tasks. Specifically, ChemEval identified 4 crucial progressive levels in chemistry, assessing 12 dimensions of LLMs across 42 distinct chemical tasks which are informed by open-source data and the data meticulously crafted by chemical experts, ensuring that the tasks have practical value and can effectively evaluate the capabilities of LLMs. In the experiment, we evaluate 12 mainstream LLMs on ChemEval under zero-shot and few-shot learning contexts, which included carefully selected demonstration examples and carefully designed prompts. The results show that while general LLMs like GPT-4 and Claude-3.5 excel in literature understanding and instruction following, they fall short in tasks demanding advanced chemical knowledge. Conversely, specialized LLMs exhibit enhanced chemical competencies, albeit with reduced literary comprehension. This suggests that LLMs have significant potential for enhancement when tackling sophisticated tasks in the field of chemistry. We believe our work will facilitate the exploration of their potential to drive progress in chemistry. Our benchmark and analysis will be available at {\color{blue} \url{<a class="link-external link-https" href="https://github.com/USTC-StarTeam/ChemEval" rel="external noopener nofollow">this https URL</a>}}.

What problem does this paper attempt to address?

The paper aims to address the evaluation issues of large language models (LLMs) in the field of chemistry. Specifically, existing evaluation benchmarks fail to adequately meet the specific needs of professional researchers in the chemistry domain. To solve this problem, the researchers proposed a new benchmark framework called ChemEval. ChemEval aims to comprehensively assess the ability of LLMs to handle various chemical tasks, encompassing a wide range of chemical knowledge from basic concepts to advanced topics. It evaluates the model's performance through 4 levels, 12 dimensions, and 42 specific chemical tasks. These tasks cover various aspects of the chemistry field, from basic knowledge Q&A to literature comprehension, molecular understanding, and scientific knowledge reasoning. The experimental section evaluated 12 mainstream models, including general LLMs and specialized chemical LLMs, in zero-shot and few-shot learning scenarios. The results show that while general LLMs perform well in literature comprehension and instruction execution, they perform poorly on tasks requiring in-depth chemical knowledge; in contrast, specialized chemical LLMs show significant improvement in chemical capabilities, although their performance in literature comprehension declines. This indicates that there is still considerable room for improvement for LLMs in the field of chemistry. The researchers hope that this work will promote the application and development of LLMs in chemical research.

ChemEval: A Comprehensive Multi-Level Chemical Evaluation for Large Language Models

What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks

ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area

Fine-tuning Large Language Models for Chemical Text Mining

LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset

Assessment of chemistry knowledge in large language models that generate code

ChemDFM: A Large Language Foundation Model for Chemistry

From Generalist to Specialist: A Survey of Large Language Models for Chemistry

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

Are large language models superhuman chemists?

From Words to Molecules: A Survey of Large Language Models in Chemistry

ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain

Large Language Models are Catalyzing Chemistry Education

BatGPT-Chem: A Foundation Large Model For Chemical Engineering

ChemDFM-X: Towards Large Multimodal Model for Chemistry

ScholarChemQA: Unveiling the Power of Language Models in Chemical Research Question Answering

Assessment of Fine-Tuned Large Language Models for Real-World Chemistry and Material Science Applications

LMM Chemical Research with Document Retrieval