Noise Analysis and Data Refinement for Chemical Reactions from US Patents via Large Language Models

Chaewon Lee,Shuan Chen,Kai Tzu-iunn Ong,Jinyoung Yeo,Yousung Jung
DOI: https://doi.org/10.26434/chemrxiv-2024-1zl02
2024-10-30
Abstract:The extraction of chemical reactions from U.S. Patent and Trademark Office (USPTO) documents has enabled significant advancements in machine learning models for organic synthesis. While the USPTO dataset offers a large and diverse collection of reaction data, recent studies have identified issues such as inconsistent or missing chemical entries, impacting data quality. To address these challenges, we employed fine-tuned large language models (LLMs) to revisit experimental sections in the US patents, performing a comprehensive analysis of noisy reaction data. Our findings demonstrate that LLMs produce fewer false reactions compared to existing datasets and reveal that many reactions in US patents involve multiple experimental steps, previously overlooked by standard extraction methods. Our analysis suggests that untraceable references and erroneous chemical names are primary sources of data noise. We also identify reaction types with high susceptibility to these issues, recommending scientists avoid using those high-risk reaction data.
Chemistry
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the noise and data quality problems in the chemical reaction data extracted from the United States Patent and Trademark Office (USPTO) documents. Specifically, the researchers focus on the following problems: 1. **Data inconsistency and missing**: There are cases of inconsistent or missing chemical entries in the USPTO dataset, which affect the data quality. 2. **Multi - step reactions are ignored**: Chemical reactions in many patents involve multiple experimental steps, but the existing extraction methods usually overlook this. 3. **Untraceable references and incorrect chemical names**: Unclear references and incorrect chemical names are the main sources of data noise. 4. **High - risk reaction types**: Some types of reactions are more likely to be affected by the above problems, and scientists are advised to avoid using these high - risk data. To solve these problems, the researchers used fine - tuned large language models (LLMs) to re - examine the experimental parts in USPTO patents, conducted a comprehensive noise data analysis, and proposed improved methods. Through this method, they hope to generate higher - quality reaction data and reduce the number of false - positive reactions. ### Specific problem description - **Data noise problem**: There is a large amount of noise data in the existing dataset, such as repeated reactions, lack of reagent information, and rare templates. - **Multi - step reactions**: Many reactions involve multiple experimental steps, but the existing extraction methods do not fully consider this. - **Chemical name recognition error**: Some chemical names cannot find the corresponding SMILES format in the database, resulting in an increase in data noise. ### Solutions The researchers proposed a new method based on large language models (LLMs) to improve the accuracy and integrity of reaction data. Specific measures include: 1. **Re - parse experimental paragraphs**: Use LLMs to re - parse the experimental paragraphs in USPTO patents to ensure that all relevant experimental steps are captured. 2. **Multi - step reaction representation**: Represent reaction data as a reaction sequence rather than a single string to better reflect multi - step reactions. 3. **Fine - grained annotation and filtering**: Structurally summarize the experimental steps in JSON format and filter out the working steps unrelated to the reaction. 4. **Atomic mapping tool**: Use the latest atomic mapping tools (such as LocalMapper) to ensure the accuracy of reactant and reagent definitions. Through these improvements, the researchers hope to significantly improve the quality of the chemical reaction data extracted from the USPTO dataset, thereby providing more reliable data support for machine - learning models in organic synthesis.