Dissecting Errors in Machine Learning for Retrosynthesis: A Granular Metric Framework and Transformer-Based Model for More Informative Predictions

Arihanth Srikar Tadanki,H. Surya Prakash Rao,U. Deva Priyakumar
DOI: https://doi.org/10.26434/chemrxiv-2024-nbdtj
2024-07-01
Abstract:Chemical reaction prediction, encompassing forward synthesis and retrosynthesis, stands as a fundamental challenge in organic synthesis. A widely adopted computational approach frames synthesis prediction as a sequence-to-sequence translation task, using the common SMILES representation for molecules. Current evaluation of machine learning methods for retrosynthesis assume perfect training data, overlooking imperfections in reaction equations in popular datasets, such as missing reactants, products, other physical and practical constraints such as temperature and cost, primarily driven by a focus on the target molecule. This limitation leads to an incomplete representation of viable synthetic routes, especially when multiple sets of reactants can yield a given desired product. In response to these shortcomings, this study examines the prevailing evaluation methods and introduces comprehensive metrics designed to address imperfections in the dataset. Our novel metrics not only assess absolute accuracy by comparing predicted outputs with ground truth but also introduce a nuanced evaluation approach. We provide scores for partial correctness and compute adjusted accuracy through graph matching, acknowledging the inherent complexities of retrosynthetic pathways. Additionally, we explore the impact of small molecular augmentations while preserving chemical properties and employ similarity matching to enhance the assessment of prediction quality. We introduce SynFormer, a sequence-to-sequence model tailored for SMILES representation. It incorporates architectural enhancements to the original transformer, effectively tackling the challenges of chemical reaction prediction. SynFormer achieves a top-1 accuracy of 53.2% on the USPTO-50k dataset, demonstrating an improvement over previous state-of-the-art language models while being more efficient and eliminating the need for pre-training.
Chemistry
What problem does this paper attempt to address?
This paper focuses on error analysis in retrosynthesis prediction, which is a fundamental challenge in organic synthesis. Existing machine learning methods often treat synthesis prediction as a sequence-to-sequence translation task, using the SMILES representation to represent molecules. However, these methods assume that the training data is perfect and disregard imperfections in reaction equations in popular datasets, such as missing reactants, products, and practical conditions like temperature and cost. This leads to an incomplete representation of feasible synthesis pathways, especially when multiple combinations of reactants can generate the same target product. To address these issues, the paper proposes a rigorous metric framework and a transformer-based model called SynFormer to provide more informative predictions. The new metrics not only evaluate absolute accuracy but also introduce partial correctness and accuracy adjusted through graph matching, acknowledging the intrinsic complexity of retrosynthesis pathways. Additionally, the paper explores methods to enhance the prediction quality while preserving chemical properties for small molecules. SynFormer is a sequence-to-sequence model optimized for the SMILES representation, which improves upon the original Transformer architecture to handle the challenges of chemical reaction prediction more effectively. SynFormer achieves a top-1 accuracy of 53.2% on the USPTO-50k dataset, surpassing previous state-of-the-art language models and being more efficient without the need for pretraining. The paper also introduces a new evaluation metric called the Retrosynthesis Refinement Index (RRI) to identify "better errors" and rank methods based on the degree of correctness in predictions. Furthermore, SynFormer does not require pretraining, reducing training time and improving efficiency. Compared to previous models, SynFormer performs exceptionally well under a more comprehensive evaluation methodology.