Fine-tuning Large Language Models for Chemical Text Mining

Mingyue Zheng,Wei Zhang,Qinggong Wang,Jiacheng Xiong,Shengkun Ni,Duanhua Cao,Buying Niu,Mingan Chen,Runze Zhang,Yitian Wang,Lehan Zhang,Xutong Li,Zhaoping Xiong,Qian Shi,Ziming Huang,Zunyun Fu,Xiangtai Kong

DOI: https://doi.org/10.26434/chemrxiv-2023-k7ct5-v2

2024-02-01

Abstract:Extracting knowledge from complex and diverse chemical texts is a pivotal task for both experimental and computational chemists. The task is still considered to be extremely challenging due to the complexity of the chemical language and scientific literature. This study explored the power of fine-tuned large language models (LLMs) on five intricate chemical text mining tasks: compound entity recognition, reaction role labelling, metal-organic framework (MOF) synthesis information extraction, nuclear magnetic resonance spectroscopy (NMR) data extraction, and the conversion of reaction paragraph to action sequence. The fine-tuned LLMs models demonstrated impressive performance, significantly reducing the need for repetitive and extensive prompt engineering experiments. For comparison, we guided GPT-3.5 and GPT-4 with prompt engineering and fine-tuned GPT-3.5 as well as other open-source LLMs such as Llama2, T5, and BART. The results showed that the fine-tuned GPT models excelled in all tasks. It achieved exact accuracy levels ranging from 69% to 95% on these tasks with minimal annotated data. It even outperformed those task-adaptive pre-training and fine-tuning models that were based on a significantly larger amount of in-domain data. Given its versatility, robustness, and low-code capability, leveraging fine-tuned LLMs as flexible and effective toolkits for automated data acquisition could revolutionize chemical knowledge extraction.

Chemistry

What problem does this paper attempt to address?

The paper attempts to address multiple complex tasks in chemical text mining, including compound entity recognition, reaction role labeling, metal-organic framework (MOF) synthesis information extraction, nuclear magnetic resonance (NMR) data extraction, and converting reaction paragraphs into operation sequences. The main objective of the study is to explore the effectiveness of fine-tuning large language models (LLMs) in these chemical text mining tasks and to improve performance by reducing reliance on repetitive and time-consuming prompt engineering experiments. The paper compares various methods, including those using only prompt engineering and fine-tuned GPT models. The results show that the fine-tuned GPT models perform excellently across these 5 tasks, even surpassing specialized models trained and fine-tuned on larger-scale domain-specific data. This indicates that utilizing fine-tuned LLMs can significantly enhance the efficiency and accuracy of chemical knowledge extraction, thereby accelerating the discovery and creation of new substances. Additionally, the study highlights the advantages of fine-tuned LLMs in terms of generality, robustness, and low-code capability when handling complex knowledge extraction tasks.

Fine-tuning Large Language Models for Chemical Text Mining

Fine-tuning Large Language Models for Chemical Text Mining

Assessment of Fine-Tuned Large Language Models for Real-World Chemistry and Material Science Applications

What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks

LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset

Leveraging large language models for predictive chemistry

ChemEval: A Comprehensive Multi-Level Chemical Evaluation for Large Language Models

Exploring the Potential of Large Language Models in Molecular Tasks: An Insightful Evaluation with GPT‐4

Accelerated end-to-end chemical synthesis development with large language models

The Role of Model Architecture and Scale in Predicting Molecular Properties: Insights from Fine-Tuning RoBERTa, BART, and LLaMA

From Words to Molecules: A Survey of Large Language Models in Chemistry

BatGPT-Chem: A Foundation Large Model For Chemical Engineering

An Autonomous Large Language Model Agent for Chemical Literature Data Mining

MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension

Are large language models superhuman chemists?

SmileyLlama: Modifying Large Language Models for Directed Chemical Space Exploration

Structured Chemistry Reasoning with Large Language Models

ChemDFM: A Large Language Foundation Model for Chemistry

Augmenting large language models with chemistry tools

Evaluation of Open-Source Large Language Models for Metal-Organic Frameworks Research