Carbohydrate NMR chemical shift predictions using E(3) equivariant graph neural networks

Maria Bånkestad,Keven M. Dorst,Göran Widmalm,Jerk Rönnols
2023-11-21
Abstract:Carbohydrates, vital components of biological systems, are well-known for their structural diversity. Nuclear Magnetic Resonance (NMR) spectroscopy plays a crucial role in understanding their intricate molecular arrangements and is essential in assessing and verifying the molecular structure of organic molecules. An important part of this process is to predict the NMR chemical shift from the molecular structure. This work introduces a novel approach that leverages E(3) equivariant graph neural networks to predict carbohydrate NMR spectra. Notably, our model achieves a substantial reduction in mean absolute error, up to threefold, compared to traditional models that rely solely on two-dimensional molecular structure. Even with limited data, the model excels, highlighting its robustness and generalization capabilities. The implications are far-reaching and go beyond an advanced understanding of carbohydrate structures and spectral interpretation. For example, it could accelerate research in pharmaceutical applications, biochemistry, and structural biology, offering a faster and more reliable analysis of molecular structures. Furthermore, our approach is a key step towards a new data-driven era in spectroscopy, potentially influencing spectroscopic techniques beyond NMR.
Machine Learning,Chemical Physics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to predict the nuclear magnetic resonance (NMR) chemical shifts of carbohydrates by using E(3) - equivariant graph neural networks. Specifically, the researchers hope to develop a method that can accurately predict the NMR spectra of carbohydrates to overcome the limitations of existing methods in dealing with complex stereochemical structures. Traditional prediction methods are usually based on two - dimensional molecular structures, and these methods perform poorly when dealing with molecules with complex stereochemistry, such as carbohydrates. The new method proposed in this paper not only considers the geometric structure of the molecule but also preserves the symmetry of the basic physical properties of the molecule, thereby improving the prediction accuracy and robustness. ### Main Contributions 1. **GeqShift Model**: A new E(3) - equivariant graph neural network model, GeqShift, is introduced for predicting the NMR chemical shifts of carbohydrates. This model can perform well when dealing with complex stereochemical structures. 2. **Data Augmentation Method**: An innovative data augmentation method is proposed, which simulates the dynamic movement of molecules in fluids and enhances the generalization ability of the model. 3. **Carbohydrate Dataset**: A dataset of carbohydrate chemical shifts suitable for machine - learning applications is compiled and publicly released to promote further research. ### Experimental Results - **Performance Evaluation**: The model performance is evaluated by ten - fold cross - validation, and each partition contains approximately 336 training structures and 39 test structures. - **Prediction Error**: For 13C and 1H chemical shifts, the mean absolute error (MAE) of the GeqShift model is 0.31 ppm and 0.032 ppm respectively, which is more than three times lower than that of the model relying only on two - dimensional molecular structure information. - **Comparison with Other Methods**: Compared with the existing state - of - the - art models (such as SG - IMP - IR), GeqShift has a significant improvement in prediction accuracy. For example, for 13C chemical shifts, the MAE of SG - IMP - IR is 1.09 ppm, while the MAE of GeqShift is only 0.31 ppm. ### Discussion - **Model Advantages**: By considering the three - dimensional structure and geometric properties of molecules, the GeqShift model can more accurately capture the influence of molecular geometric changes on chemical shifts. - **Data Augmentation**: Using multiple conformations for data augmentation to simulate the real dynamic properties of molecules significantly improves the generalization ability and prediction accuracy of the model. - **Future Applications**: This model is not only applicable to carbohydrates but also has the potential to predict the chemical shifts of other organic molecules, especially those compounds with asymmetric centers, such as drug molecules and proteins. In addition, this method can also be applied to other analytical techniques, such as X - ray photoelectron spectroscopy and X - ray absorption spectroscopy. ### Formula Examples - **Equivariant Convolution**: - Definition of equivariant function: \[ f(D_X(g)(x)) = D_Y(g)f(x) \] - Among them, \( D_X(g) \) and \( D_Y(g) \) are transformation matrices parameterized by \( g \), acting on the input space \( X \) and output space \( Y \) respectively. - **Message Passing Process**: - Message function: \[ m_{ij}^k = f_m(x_i^k, x_j^k, e_{ij}^k) \] - Aggregation function: \[ \hat{x}_i^{k + 1} = f_a\left(\sum_{j \in N(i)} m_{ij}^k\right) \] - Update function: \[ x_i^{k + 1} = f_u(x_i^k, \hat{x}_i^{k + 1}) \] Through these innovations, this research provides an efficient and accurate solution for the prediction of NMR spectra of carbohydrates and is expected to accelerate the research progress in related fields.