AutoTemplate: Enhancing Chemical Reaction Datasets for Machine Learning Applications in Organic Chemistry

Lung-Yi Chen,Yi-Pei Li
DOI: https://doi.org/10.26434/chemrxiv-2024-tq22r
2024-03-15
Abstract:This paper presents AutoTemplate, an innovative data preprocessing protocol, addressing the crucial need for high-quality chemical reaction datasets in the realm of machine learning applications in organic chemistry. Recent advances in artificial intelligence have expanded the application of machine learning in chemistry, particularly in yield prediction, retrosynthesis, and reaction condition prediction. However, the effectiveness of these models hinges on the integrity of chemical reaction datasets, which are often plagued by inconsistencies like missing reactants, incorrect atom mappings, and outright erroneous reactions. AutoTemplate introduces a twostage approach to refine these datasets. The first stage involves extracting meaningful reaction transformation rules and formulating generic reaction templates using a simplified SMARTS representation. This simplification broadens the applicability of templates across various chemical reactions. The second stage is template-guided reaction verification, where these templates are systematically applied to validate and correct the reaction data. This process effectively amends missing reactant information, rectifies atom-mapping errors, and eliminates incorrect data entries. A standout feature of AutoTemplate is its capability to concurrently identify and correct false chemical reactions. It operates on the premise that most reactions in datasets are accurate, using these as templates to guide the correction of flawed entries. The protocol demonstrates its efficacy across a range of chemical reactions, significantly enhancing dataset quality. This advancement provides a more robust foundation for developing reliable machine learning models in chemistry, thereby improving the accuracy of forward and retrosynthetic predictions. AutoTemplate marks a significant progression in the preprocessing of chemical reaction datasets, bridging a vital gap and facilitating more precise and efficient machine learning applications in organic synthesis. Scientific contribution: The proposed automated preprocessing tool for chemical reaction data aims to identify errors within chemical databases. Specifically, if the errors involve atom mapping or the absence of reactant types, corrections can be systematically applied using reaction templates, ultimately elevating the overall quality of the database.
Chemistry
What problem does this paper attempt to address?
The paper mainly addresses the issue of the quality of chemical reaction datasets, especially in the field of organic chemistry for machine learning applications. High-quality datasets are crucial for tasks such as predicting reaction yield, retrosynthetic analysis, and reaction conditions. However, existing chemical reaction datasets often suffer from inconsistencies and errors, such as missing reactants, incorrect atom mapping, and erroneous reaction records. The paper proposes a new method called AutoTemplate, which consists of two stages: general template extraction and template-guided reaction validation. In the first stage, meaningful reaction transformation rules are extracted and general reaction templates are generated by simplifying the SMARTS notation, enabling the templates to be widely applicable to various chemical reactions. In the second stage, these templates are systematically used to validate and correct reaction data, including correcting atom mapping errors, supplementing missing reactant information, and identifying and removing erroneous reaction records. AutoTemplate assumes that most of the reaction data is correct and uses correct reactions as templates to guide the correction of erroneous data. The contribution of AutoTemplate lies in improving the data preprocessing capability by simultaneously detecting and correcting erroneous chemical reactions, thereby providing a solid foundation for developing more reliable machine learning models and improving the accuracy of forward and retrosynthetic prediction.