Abstract:Chemical reaction prediction, encompassing forward synthesis and retrosynthesis, stands as a fundamental challenge in organic synthesis. A widely adopted computational approach frames synthesis prediction as a sequence-to-sequence translation task, using the common SMILES representation for molecules. Current evaluation of machine learning methods for retrosynthesis assume perfect training data, overlooking imperfections in reaction equations in popular datasets, such as missing reactants, products, other physical and practical constraints such as temperature and cost, primarily driven by a focus on the target molecule. This limitation leads to an incomplete representation of viable synthetic routes, especially when multiple sets of reactants can yield a given desired product. In response to these shortcomings, this study examines the prevailing evaluation methods and introduces comprehensive metrics designed to address imperfections in the dataset. Our novel metrics not only assess absolute accuracy by comparing predicted outputs with ground truth but also introduce a nuanced evaluation approach. We provide scores for partial correctness and compute adjusted accuracy through graph matching, acknowledging the inherent complexities of retrosynthetic pathways. Additionally, we explore the impact of small molecular augmentations while preserving chemical properties and employ similarity matching to enhance the assessment of prediction quality. We introduce SynFormer, a sequence-to-sequence model tailored for SMILES representation. It incorporates architectural enhancements to the original transformer, effectively tackling the challenges of chemical reaction prediction. SynFormer achieves a top-1 accuracy of 53.2% on the USPTO-50k dataset, demonstrating an improvement over previous state-of-the-art language models while being more efficient and eliminating the need for pre-training.

What problem does this paper attempt to address?

This paper focuses on error analysis in retrosynthesis prediction, which is a fundamental challenge in organic synthesis. Existing machine learning methods often treat synthesis prediction as a sequence-to-sequence translation task, using the SMILES representation to represent molecules. However, these methods assume that the training data is perfect and disregard imperfections in reaction equations in popular datasets, such as missing reactants, products, and practical conditions like temperature and cost. This leads to an incomplete representation of feasible synthesis pathways, especially when multiple combinations of reactants can generate the same target product. To address these issues, the paper proposes a rigorous metric framework and a transformer-based model called SynFormer to provide more informative predictions. The new metrics not only evaluate absolute accuracy but also introduce partial correctness and accuracy adjusted through graph matching, acknowledging the intrinsic complexity of retrosynthesis pathways. Additionally, the paper explores methods to enhance the prediction quality while preserving chemical properties for small molecules. SynFormer is a sequence-to-sequence model optimized for the SMILES representation, which improves upon the original Transformer architecture to handle the challenges of chemical reaction prediction more effectively. SynFormer achieves a top-1 accuracy of 53.2% on the USPTO-50k dataset, surpassing previous state-of-the-art language models and being more efficient without the need for pretraining. The paper also introduces a new evaluation metric called the Retrosynthesis Refinement Index (RRI) to identify "better errors" and rank methods based on the degree of correctness in predictions. Furthermore, SynFormer does not require pretraining, reducing training time and improving efficiency. Compared to previous models, SynFormer performs exceptionally well under a more comprehensive evaluation methodology.

Dissecting Errors in Machine Learning for Retrosynthesis: A Granular Metric Framework and Transformer-Based Model for More Informative Predictions

RetroPrime: A Chemistry-Inspired and Transformer-based Method for Retrosynthesis Predictions

RetroPrime: A Diverse, Plausible and Transformer-based Method for Single-Step Retrosynthesis Predictions

Retroformer: Pushing the Limits of Interpretable End-to-end Retrosynthesis Transformer

Investigating the Reliability and Interpretability of Machine Learning Frameworks for Chemical Retrosynthesis

Permutation invariant graph-to-sequence model for template-free retrosynthesis and reaction prediction

Learning Graph Models for Retrosynthesis Prediction

Molecular Transformer - A Model for Uncertainty-Calibrated Chemical Reaction Prediction

Prediction of Organic Reaction Outcomes Using Machine Learning

Retrosynthesis Prediction with an Iterative String Editing Model

Molecule-Edit Templates for Efficient and Accurate Retrosynthesis Prediction

Improve retrosynthesis planning with a molecular editing language

Bridging Chemical Knowledge and Machine Learning for Performance Prediction of Organic Synthesis.

State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis

Deep Retrosynthetic Reaction Prediction using Local Reactivity and Global Attention

RetroXpert: Decompose Retrosynthesis Prediction like a Chemist

RetroComposer: Composing Templates for Template-Based Retrosynthesis Prediction

Mind the Retrosynthesis Gap: Bridging the divide between Single-step and Multi-step Retrosynthesis Prediction

A Graph to Graphs Framework for Retrosynthesis Prediction

CTsynther: Contrastive Transformer model for end-to-end retrosynthesis prediction