Improve retrosynthesis planning with a molecular editing language

Mingyue Zheng,Jiacheng Xiong,Wei Zhang, Zunyun Fu,Jiatao Huang,Yitian Wang,Zhaoping Xiong,Xiangtai Kong
DOI: https://doi.org/10.26434/chemrxiv-2023-bxhk8
2023-12-26
Abstract:Retrosynthetic analysis is a fundamental strategy in the field of organic synthesis, and many computational methods have been developed to address this significant task. A widely adopted approach is to treat retrosynthetic prediction as a sequence-to-sequence (seq2seq) translation task, where the Simplified Molecular Input Line Entry System (SMILES) of a product is translated into the SMILES of its corresponding reactants. However, these sequence-based models using SMILES also face many issues, including limited performance, lack of interpretability, and controllability. In this work, we introduce a novel chemical language for retrosynthetic prediction named E-SMILES, which is an extension of SMILES specially designed for seq2seq retrosynthetic prediction. This language not only documents the static molecular structure but also encodes the editing operations of the molecule in the retrosynthetic process, enabling it to characterize retrosynthesis reactions more effectively. By using E-SMILES, seq2seq retrosynthetic models can simulate the stepwise retrosynthetic analysis strategy of chemists, ensuring the matching of atoms between the predicted reactants and product, and yielding more interpretable and controllable predictions. Furthermore, E-SMILES is naturally aligned with the product's SMILES, reducing the edit distance between the model's input and output sequences. This liberates the model from learning the complex SMILES syntax and allows it to focus more on the retrosynthesis process itself. Leveraging E-SMILES, our retrosynthesis model achieves top-1 accuracies of 58.9% and 68.5% on the USPTO-50k dataset, with and without given reaction class, respectively, significantly surpassing previous state-of-the-art results. We envisage that E-SMILES can serve as a new foundational tool, promoting the development of sequence-based retrosynthetic prediction methods.
Chemistry
What problem does this paper attempt to address?
This paper focuses on the retro-synthetic analysis problem in organic synthesis, which is an important strategy in the field of chemistry for designing synthetic routes of compounds. Existing methods often treat retro-synthetic prediction as a sequence-to-sequence (seq2seq) translation task, using the SMILES representation to convert the SMILES of the product molecule to the SMILES of the reactant. However, this approach has limitations in performance, interpretability, and controllability. In the paper, the researchers propose a new chemical language called E-SMILES (Edited SMILES), which is an extension of SMILES specifically designed for retro-synthetic prediction. E-SMILES not only records the static structure of the molecule but also encodes molecular editing operations such as bond cleavage and changes during the retro-synthetic process. With E-SMILES, the seq2seq model can simulate the stepwise retro-synthetic analysis strategy of chemists, ensuring atom matching and generating more interpretable and controllable predictions. Using E-SMILES, the model achieves accuracies of 58.9% (without given reaction class) and 68.5% (with given reaction class) on the USPTO-50k dataset, significantly outperforming previous methods. The paper suggests that E-SMILES can serve as a new tool to promote the development of sequence-based retro-synthetic prediction methods. Despite the advantages of sequence-based methods, directly translating from product SMILES to reactant SMILES lacks the reasoning process of human experts. E-SMILES addresses this issue by providing better atom mapping, enhancing the interpretability and controllability of the model. Through comparative experiments, the paper demonstrates the advantages of E-SMILES over other languages and showcases its consistent performance in different types of chemical reactions.