Prediction of chemical reaction yields with large-scale multi-view pre-training

Runhan Shi,Gufeng Yu,Xiaohong Huo,Yang Yang
DOI: https://doi.org/10.1186/s13321-024-00815-2
2024-02-28
Journal of Cheminformatics
Abstract:Developing machine learning models with high generalization capability for predicting chemical reaction yields is of significant interest and importance. The efficacy of such models depends heavily on the representation of chemical reactions, which has commonly been learned from SMILES or graphs of molecules using deep neural networks. However, the progression of chemical reactions is inherently determined by the molecular 3D geometric properties, which have been recently highlighted as crucial features in accurately predicting molecular properties and chemical reactions. Additionally, large-scale pre-training has been shown to be essential in enhancing the generalization capability of complex deep learning models. Based on these considerations, we propose the Reaction Multi-View Pre-training (ReaMVP) framework, which leverages self-supervised learning techniques and a two-stage pre-training strategy to predict chemical reaction yields. By incorporating multi-view learning with 3D geometric information, ReaMVP achieves state-of-the-art performance on two benchmark datasets. Notably, the experimental results indicate that ReaMVP has a significant advantage in predicting out-of-sample data, suggesting an enhanced generalization ability to predict new reactions. Scientific Contribution: This study presents the ReaMVP framework, which improves the generalization capability of machine learning models for predicting chemical reaction yields. By integrating sequential and geometric views and leveraging self-supervised learning techniques with a two-stage pre-training strategy, ReaMVP achieves state-of-the-art performance on benchmark datasets. The framework demonstrates superior predictive ability for out-of-sample data and enhances the prediction of new reactions.
chemistry, multidisciplinary,computer science, interdisciplinary applications, information systems
What problem does this paper attempt to address?
The paper aims to address the problem of predicting chemical reaction yields, particularly in the field of organic chemistry, by using machine learning methods to efficiently and accurately predict chemical reaction yields. Accurate prediction of chemical reaction yields is crucial for guiding synthetic chemists in selecting appropriate molecular synthesis routes, especially in identifying highly active and selective catalysts. The paper proposes a new framework called "ReaMVP (Reaction Multi-View Pre-training)," which combines information from sequence views and geometric views and utilizes self-supervised learning techniques with a two-stage pre-training strategy to improve the model's generalization ability. Specifically, the contributions of ReaMVP include: 1. **Multi-View Representation Learning**: By simultaneously considering the sequence information (such as SMILES representation) and geometric information (such as molecular 3D structures) of chemical reactions, ReaMVP can capture richer and more comprehensive structural information. In particular, it proposes an effective method to encode chemical reactions in the geometric view. 2. **Self-Supervised Pre-Training Method**: Based on distribution alignment and contrastive learning, ReaMVP introduces a novel self-supervised pre-training method that can capture the consistency of chemical reactions from different views. 3. **Large-Scale Pre-Training Enhances Generalization Ability**: By leveraging large-scale datasets for pre-training, ReaMVP demonstrates high generalization ability in predicting chemical reaction yields, outperforming baseline models on benchmark datasets, especially when dealing with molecules not present in the training set (i.e., out-of-sample conditions). Experimental results show that ReaMVP achieves significant performance improvements on two benchmark datasets—the Buchwald-Hartwig and Suzuki-Miyaura datasets—particularly when handling out-of-sample conditions, indicating that ReaMVP has strong generalization ability and can perform well in predicting new reactions.