Abstract:The extraction of chemical reactions from U.S. Patent and Trademark Office (USPTO) documents has enabled significant advancements in machine learning models for organic synthesis. While the USPTO dataset offers a large and diverse collection of reaction data, recent studies have identified issues such as inconsistent or missing chemical entries, impacting data quality. To address these challenges, we employed fine-tuned large language models (LLMs) to revisit experimental sections in the US patents, performing a comprehensive analysis of noisy reaction data. Our findings demonstrate that LLMs produce fewer false reactions compared to existing datasets and reveal that many reactions in US patents involve multiple experimental steps, previously overlooked by standard extraction methods. Our analysis suggests that untraceable references and erroneous chemical names are primary sources of data noise. We also identify reaction types with high susceptibility to these issues, recommending scientists avoid using those high-risk reaction data.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the noise and data quality problems in the chemical reaction data extracted from the United States Patent and Trademark Office (USPTO) documents. Specifically, the researchers focus on the following problems: 1. **Data inconsistency and missing**: There are cases of inconsistent or missing chemical entries in the USPTO dataset, which affect the data quality. 2. **Multi - step reactions are ignored**: Chemical reactions in many patents involve multiple experimental steps, but the existing extraction methods usually overlook this. 3. **Untraceable references and incorrect chemical names**: Unclear references and incorrect chemical names are the main sources of data noise. 4. **High - risk reaction types**: Some types of reactions are more likely to be affected by the above problems, and scientists are advised to avoid using these high - risk data. To solve these problems, the researchers used fine - tuned large language models (LLMs) to re - examine the experimental parts in USPTO patents, conducted a comprehensive noise data analysis, and proposed improved methods. Through this method, they hope to generate higher - quality reaction data and reduce the number of false - positive reactions. ### Specific problem description - **Data noise problem**: There is a large amount of noise data in the existing dataset, such as repeated reactions, lack of reagent information, and rare templates. - **Multi - step reactions**: Many reactions involve multiple experimental steps, but the existing extraction methods do not fully consider this. - **Chemical name recognition error**: Some chemical names cannot find the corresponding SMILES format in the database, resulting in an increase in data noise. ### Solutions The researchers proposed a new method based on large language models (LLMs) to improve the accuracy and integrity of reaction data. Specific measures include: 1. **Re - parse experimental paragraphs**: Use LLMs to re - parse the experimental paragraphs in USPTO patents to ensure that all relevant experimental steps are captured. 2. **Multi - step reaction representation**: Represent reaction data as a reaction sequence rather than a single string to better reflect multi - step reactions. 3. **Fine - grained annotation and filtering**: Structurally summarize the experimental steps in JSON format and filter out the working steps unrelated to the reaction. 4. **Atomic mapping tool**: Use the latest atomic mapping tools (such as LocalMapper) to ensure the accuracy of reactant and reagent definitions. Through these improvements, the researchers hope to significantly improve the quality of the chemical reaction data extracted from the USPTO dataset, thereby providing more reliable data support for machine - learning models in organic synthesis.

Noise Analysis and Data Refinement for Chemical Reactions from US Patents via Large Language Models

Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature

Stress Testing BERT Anaphora Resolution Models for Reaction Extraction in Chemical Patents

Chemical Reaction Extraction from Long Patent Documents

Extracting Structured Data from Organic Synthesis Procedures Using a Fine-Tuned Large Language Model

Automated patent extraction powers generative modeling in focused chemical spaces

ReLM: Leveraging Language Models for Enhanced Chemical Reaction Prediction

From Tokenization to Self-Supervision: Building a High-Performance Information Extraction System for Chemical Reactions in Patents

Mining Patents with Large Language Models Elucidates the Chemical Function Landscape

NETWORK ANALYSIS OF THE ORGANIC CHEMISTRY IN PATENTS, LITERATURE, AND PHARMACEUTICAL INDUSTRY

Fine-tuning Large Language Models for Chemical Text Mining

Learning Chemical Reaction Representation with Reactant-Product Alignment

From Words to Molecules: A Survey of Large Language Models in Chemistry

ORDerly: Data Sets and Benchmarks for Chemical Reaction Data

Integrating Machine Learning and Large Language Models to Advance Wu Exploration of Electrochemical Reactions

ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents

Exploring Chemical Reaction Space with Machine Learning Models: Representation and Feature Perspective

Integrating Machine Learning and Large Language Models to Advance Exploration of Electrochemical Reactions

Prediction of Organic Reaction Outcomes Using Machine Learning

Advances in machine learning with chemical language models in molecular property and reaction outcome predictions