Explainable Graph Neural Networks with Data Augmentation for Predicting p K a of C–H Acids

Hongle An,Xuyang Liu,Wensheng Cai,Xueguang Shao
DOI: https://doi.org/10.1021/acs.jcim.3c00958
IF: 6.162
2023-09-14
Journal of Chemical Information and Modeling
Abstract:The p<i>K</i><sub>a</sub> of C-H acids is an important parameter in the fields of organic synthesis, drug discovery, and materials science. However, the prediction of p<i>K</i><sub>a</sub> is still a great challenge due to the limit of experimental data and the lack of chemical insight. Here, a new model for predicting the p<i>K</i><sub>a</sub> values of C-H acids is proposed on the basis of graph neural networks (GNNs) and data augmentation. A message passing unit (MPU) was used to extract the topological and target-related information from the molecular graph data, and a readout layer was utilized to retrieve the information on the ionization site C atom. The retrieved information then was adopted to predict p<i>K</i><sub>a</sub> by a fully connected network. Furthermore, to increase the diversity of the training data, a knowledge-infused data augmentation technique was established by replacing the H atoms in a molecule with substituents exhibiting different electronic effects. The MPU was pretrained with the augmented data. The efficacy of data augmentation was confirmed by visualizing the distribution of compounds with different substituents and by classifying compounds. The explainability of the model was studied by examining the change of p<i>K</i><sub>a</sub> values when a specific atom was masked. This explainability was used to identify the key substituents for p<i>K</i><sub>a</sub>. The model was evaluated on two data sets from the <i>i</i>BonD database. Dataset1 includes the experimental p<i>K</i><sub>a</sub> values of C-H acids measured in DMSO, while dataset2 comprises the p<i>K</i><sub>a</sub> values measured in water. The results show that the knowledge-infused data augmentation technique greatly improves the predictive accuracy of the model, especially when the number of samples is small.
chemistry, multidisciplinary, medicinal,computer science, interdisciplinary applications, information systems
What problem does this paper attempt to address?