Improve retrosynthesis planning with a molecular editing language

Mingyue Zheng,Jiacheng Xiong,Wei Zhang, Zunyun Fu,Jiatao Huang,Yitian Wang,Zhaoping Xiong,Xiangtai Kong

DOI: https://doi.org/10.26434/chemrxiv-2023-bxhk8

2023-12-26

Abstract:Retrosynthetic analysis is a fundamental strategy in the field of organic synthesis, and many computational methods have been developed to address this significant task. A widely adopted approach is to treat retrosynthetic prediction as a sequence-to-sequence (seq2seq) translation task, where the Simplified Molecular Input Line Entry System (SMILES) of a product is translated into the SMILES of its corresponding reactants. However, these sequence-based models using SMILES also face many issues, including limited performance, lack of interpretability, and controllability. In this work, we introduce a novel chemical language for retrosynthetic prediction named E-SMILES, which is an extension of SMILES specially designed for seq2seq retrosynthetic prediction. This language not only documents the static molecular structure but also encodes the editing operations of the molecule in the retrosynthetic process, enabling it to characterize retrosynthesis reactions more effectively. By using E-SMILES, seq2seq retrosynthetic models can simulate the stepwise retrosynthetic analysis strategy of chemists, ensuring the matching of atoms between the predicted reactants and product, and yielding more interpretable and controllable predictions. Furthermore, E-SMILES is naturally aligned with the product's SMILES, reducing the edit distance between the model's input and output sequences. This liberates the model from learning the complex SMILES syntax and allows it to focus more on the retrosynthesis process itself. Leveraging E-SMILES, our retrosynthesis model achieves top-1 accuracies of 58.9% and 68.5% on the USPTO-50k dataset, with and without given reaction class, respectively, significantly surpassing previous state-of-the-art results. We envisage that E-SMILES can serve as a new foundational tool, promoting the development of sequence-based retrosynthetic prediction methods.

Chemistry

What problem does this paper attempt to address?

This paper focuses on the retro-synthetic analysis problem in organic synthesis, which is an important strategy in the field of chemistry for designing synthetic routes of compounds. Existing methods often treat retro-synthetic prediction as a sequence-to-sequence (seq2seq) translation task, using the SMILES representation to convert the SMILES of the product molecule to the SMILES of the reactant. However, this approach has limitations in performance, interpretability, and controllability. In the paper, the researchers propose a new chemical language called E-SMILES (Edited SMILES), which is an extension of SMILES specifically designed for retro-synthetic prediction. E-SMILES not only records the static structure of the molecule but also encodes molecular editing operations such as bond cleavage and changes during the retro-synthetic process. With E-SMILES, the seq2seq model can simulate the stepwise retro-synthetic analysis strategy of chemists, ensuring atom matching and generating more interpretable and controllable predictions. Using E-SMILES, the model achieves accuracies of 58.9% (without given reaction class) and 68.5% (with given reaction class) on the USPTO-50k dataset, significantly outperforming previous methods. The paper suggests that E-SMILES can serve as a new tool to promote the development of sequence-based retro-synthetic prediction methods. Despite the advantages of sequence-based methods, directly translating from product SMILES to reactant SMILES lacks the reasoning process of human experts. E-SMILES addresses this issue by providing better atom mapping, enhancing the interpretability and controllability of the model. Through comparative experiments, the paper demonstrates the advantages of E-SMILES over other languages and showcases its consistent performance in different types of chemical reactions.

Improve retrosynthesis planning with a molecular editing language

Retrosynthesis Prediction with an Iterative String Editing Model

Retrosynthesis prediction using an end-to-end graph generative architecture for molecular graph editing

Dissecting Errors in Machine Learning for Retrosynthesis: A Granular Metric Framework and Transformer-Based Model for More Informative Predictions

RetroPrime: A Chemistry-Inspired and Transformer-based Method for Retrosynthesis Predictions

Root-aligned SMILES: a Tight Representation for Chemical Reaction Prediction

Permutation invariant graph-to-sequence model for template-free retrosynthesis and reaction prediction

RetroPrime: A Diverse, Plausible and Transformer-based Method for Single-Step Retrosynthesis Predictions

Retrosynthesis prediction enhanced by in-silico reaction data augmentation

Retrosynthesis prediction with an interpretable deep-learning framework based on molecular assembly tasks

Leveraging Reaction-aware Substructures for Retrosynthesis Analysis

UAlign: Pushing the Limit of Template-free Retrosynthesis Prediction with Unsupervised SMILES Alignment

RetroComposer: Composing Templates for Template-Based Retrosynthesis Prediction

Bridging Chemical Knowledge and Machine Learning for Performance Prediction of Organic Synthesis.

Molecule-Edit Templates for Efficient and Accurate Retrosynthesis Prediction

Deep Retrosynthetic Reaction Prediction using Local Reactivity and Global Attention

Learning to Plan Chemical Syntheses

CTsynther: Contrastive Transformer model for end-to-end retrosynthesis prediction

Recent advances in deep learning for retrosynthesis

Retroformer: Pushing the Limits of Interpretable End-to-end Retrosynthesis Transformer

BatGPT-Chem: A Foundation Large Model For Retrosynthesis Prediction