Extracting Structured Data from Organic Synthesis Procedures Using a Fine-Tuned Large Language Model

Qianxiang Ai,Fanwang Meng,Jiale Shi,Brenden Pelkie,Connor W. Coley

DOI: https://doi.org/10.26434/chemrxiv-2024-979fz

2024-04-08

Abstract:The popularity of data-driven approaches and machine learning (ML) techniques in the field of organic chemistry and its various subfields has increased the value of structured reaction data. Most data in chemistry is represented by unstructured text, and due to the vastness of the organic chemistry literature (papers, patents), manual conversion from unstructured text to structured data remains a largely manual endeavor. Software tools for this task would facilitate downstream applications such as reaction prediction and condition recommendation. In this study, we leverage the power of fine-tuned large language models (LLMs) to extract reaction information from organic synthesis procedure text into structured data following the Open Reaction Database (ORD) schema, a comprehensive data structure designed for organic reactions. The fine-tuned model produces syntactically correct ORD records with an average accuracy of 91.25% for ORD “messages” (e.g., full compound, workups, or condition definitions) and 92.25% for individual data fields (e.g., compound identifiers, mass quantities), with the ability to recognize compound-referencing tokens and to infer reaction roles. We investigate its failure modes and evaluate performance on specific subtasks such as reaction role classification.

Chemistry

What problem does this paper attempt to address?

The paper aims to address the issue of structured data extraction from organic synthesis procedure texts. Specifically, the researchers utilized fine-tuned large language models (LLMs) to extract reaction information from organic synthesis procedure texts and convert it into structured data following the Open Reaction Database (ORD) schema. Since most data in chemical literature (such as papers, patents, etc.) exist in unstructured text form, manually converting these unstructured texts into structured data is both time-consuming and inefficient. Therefore, developing an automated tool to accomplish this task would greatly facilitate the utilization of historical reaction data and accelerate data-driven research discoveries. The paper achieved this goal through the following methods: 1. **Data Preparation**: A large amount of reaction data was obtained from the United States Patent and Trademark Office (USPTO), and a specific data processing workflow was adopted to ensure the quality and consistency of the data. 2. **Model Fine-tuning**: An open-source large language model, LLaMA-2-7B, was selected and fine-tuned with parameter-efficient techniques to generate structured data that meets the ORD format requirements. 3. **Evaluation and Analysis**: The model's performance was validated through detailed quantitative evaluation metrics, and the results showed that the model has high accuracy in compound recognition, condition description, and procedural steps. Overall, this study demonstrates how advanced large language models can be efficiently used to extract key information from unstructured texts of organic synthesis reactions and convert it into structured data, thereby providing strong support for subsequent chemical research.

Extracting Structured Data from Organic Synthesis Procedures Using a Fine-Tuned Large Language Model

SynAsk: Unleashing the Power of Large Language Models in Organic Synthesis

An Autonomous Large Language Model Agent for Chemical Literature Data Mining

Bridging Chemical Knowledge and Machine Learning for Performance Prediction of Organic Synthesis.

Large Language Models for Inorganic Synthesis Predictions

Automated Chemical Reaction Extraction from Scientific Literature

Machine Learning Prediction of Structure‐Performance Relationship in Organic Synthesis

Structured information extraction from complex scientific text with fine-tuned large language models

Structured information extraction from scientific text with large language models

Fine-tuning Large Language Models for Chemical Text Mining

Structured Chemistry Reasoning with Large Language Models

Inferring experimental procedures from text-based representations of chemical reactions

OpenChemIE: An Information Extraction Toolkit For Chemistry Literature

Prediction of Organic Reaction Outcomes Using Machine Learning

Leveraging Reaction-aware Substructures for Retrosynthesis Analysis

State of the Art and Outlook of Data Science and Machine Learning in Organic Chemistry

Validation of the Scientific Literature via Chemputation Augmented by Large Language Models

Automated electrosynthesis reaction mining with multimodal large language models (MLLMs)

An Automatic End-to-end Chemical Synthesis Development Platform Powered by Large Language Models