AutoTemplate: Enhancing Chemical Reaction Datasets for Machine Learning Applications in Organic Chemistry

Lung-Yi Chen,Yi-Pei Li

DOI: https://doi.org/10.26434/chemrxiv-2024-tq22r

2024-03-15

Abstract:This paper presents AutoTemplate, an innovative data preprocessing protocol, addressing the crucial need for high-quality chemical reaction datasets in the realm of machine learning applications in organic chemistry. Recent advances in artificial intelligence have expanded the application of machine learning in chemistry, particularly in yield prediction, retrosynthesis, and reaction condition prediction. However, the effectiveness of these models hinges on the integrity of chemical reaction datasets, which are often plagued by inconsistencies like missing reactants, incorrect atom mappings, and outright erroneous reactions. AutoTemplate introduces a twostage approach to refine these datasets. The first stage involves extracting meaningful reaction transformation rules and formulating generic reaction templates using a simplified SMARTS representation. This simplification broadens the applicability of templates across various chemical reactions. The second stage is template-guided reaction verification, where these templates are systematically applied to validate and correct the reaction data. This process effectively amends missing reactant information, rectifies atom-mapping errors, and eliminates incorrect data entries. A standout feature of AutoTemplate is its capability to concurrently identify and correct false chemical reactions. It operates on the premise that most reactions in datasets are accurate, using these as templates to guide the correction of flawed entries. The protocol demonstrates its efficacy across a range of chemical reactions, significantly enhancing dataset quality. This advancement provides a more robust foundation for developing reliable machine learning models in chemistry, thereby improving the accuracy of forward and retrosynthetic predictions. AutoTemplate marks a significant progression in the preprocessing of chemical reaction datasets, bridging a vital gap and facilitating more precise and efficient machine learning applications in organic synthesis. Scientific contribution: The proposed automated preprocessing tool for chemical reaction data aims to identify errors within chemical databases. Specifically, if the errors involve atom mapping or the absence of reactant types, corrections can be systematically applied using reaction templates, ultimately elevating the overall quality of the database.

Chemistry

What problem does this paper attempt to address?

The paper mainly addresses the issue of the quality of chemical reaction datasets, especially in the field of organic chemistry for machine learning applications. High-quality datasets are crucial for tasks such as predicting reaction yield, retrosynthetic analysis, and reaction conditions. However, existing chemical reaction datasets often suffer from inconsistencies and errors, such as missing reactants, incorrect atom mapping, and erroneous reaction records. The paper proposes a new method called AutoTemplate, which consists of two stages: general template extraction and template-guided reaction validation. In the first stage, meaningful reaction transformation rules are extracted and general reaction templates are generated by simplifying the SMARTS notation, enabling the templates to be widely applicable to various chemical reactions. In the second stage, these templates are systematically used to validate and correct reaction data, including correcting atom mapping errors, supplementing missing reactant information, and identifying and removing erroneous reaction records. AutoTemplate assumes that most of the reaction data is correct and uses correct reactions as templates to guide the correction of erroneous data. The contribution of AutoTemplate lies in improving the data preprocessing capability by simultaneously detecting and correcting erroneous chemical reactions, thereby providing a solid foundation for developing more reliable machine learning models and improving the accuracy of forward and retrosynthetic prediction.

AutoTemplate: Enhancing Chemical Reaction Datasets for Machine Learning Applications in Organic Chemistry

Automated reaction database and reaction network analysis: extraction of reaction templates using cheminformatics

Harnessing Data Augmentation and Normalization Preprocessing to Improve the Performance of Chemical Reaction Predictions of Data-Driven Model

Accelerating Reaction Network Explorations with Automated Reaction Template Extraction and Application

Automated Experimentation Powers Data Science in Chemistry.

Learning Chemical Rules of Retrosynthesis with Pre-training.

A generalized-template-based graph neural network for accurate organic reactivity prediction

Learning Chemical Reaction Representation with Reactant-Product Alignment

Automated Chemical Reaction Extraction from Scientific Literature

Prediction of Organic Reaction Outcomes Using Machine Learning

SynTemp: Efficient Extraction of Graph-Based Reaction Rules from Large-Scale Reaction Databases

Reaction Templates: Bridging Synthesis Knowledge and Artificial Intelligence

ORDerly: Data Sets and Benchmarks for Chemical Reaction Data

Site-Specific Template Generative Approach for Retrosynthetic Planning

Exploring Chemical Reaction Space with Machine Learning Models: Representation and Feature Perspective

Molecule-Edit Templates for Efficient and Accurate Retrosynthesis Prediction

Reagent prediction with a molecular transformer improves reaction data quality

Reaction Rebalancing: A Novel Approach to Curating Reaction Databases

A large-scale reaction dataset of mechanistic pathways of organic reactions

Dissecting Errors in Machine Learning for Retrosynthesis: A Granular Metric Framework and Transformer-Based Model for More Informative Predictions

Completing and Balancing Database Excerpted Chemical Reactions with a Hybrid Mechanistic-Machine Learning Approach