Accurate Prediction of Aqueous Free Solvation Energies Using 3D Atomic Feature-Based Graph Neural Network with Transfer Learning

Dongdong Zhang,Song Xia,Yingkai Zhang
DOI: https://doi.org/10.1021/acs.jcim.2c00260
IF: 6.162
2022-04-14
Journal of Chemical Information and Modeling
Abstract:Graph neural network (GNN)-based deep learning (DL) models have been widely implemented to predict the experimental aqueous solvation free energy, while its prediction accuracy has reached a plateau partly due to the scarcity of available experimental data. In order to tackle this challenge, we first build a large and diverse calculated data set Frag20-Aqsol-100K of aqueous solvation free energy with reasonable computational cost and accuracy via electronic structure calculations with continuum solvent models. Then, we develop a novel 3D atomic feature-based GNN model with the principal neighborhood aggregation (PNAConv) and demonstrate that 3D atomic features obtained from molecular mechanics-optimized geometries can significantly improve the learning power of GNN models in predicting calculated solvation free energies. Finally, we employ a transfer learning strategy by pre-training our DL model on Frag20-Aqsol-100K and fine-tuning it on the small experimental data set, and the fine-tuned model A3D-PNAConv-FT achieves the state-of-the-art prediction on the FreeSolv data set with a root-mean-squared error of 0.719 kcal/mol and a mean-absolute error of 0.417 kcal/mol using random data splits. These results indicate that integrating molecular modeling and DL would be a promising strategy to develop robust prediction models in molecular science. The source code and data are accessible at: https://yzhang.hpc.nyu.edu/IMA.The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.2c00260.Data distribution for Frag20-Aqsol-100K and FreeSolv with fixed data split; scatter plots for the test set of FreeSolv by A3D-PNAConv-FT; bond attributes and the corresponding encoding methods that are used to build the initial 2D bond feature vectors; message functions and updating functions in MPNN for selected GNN modules; atomic attributes and the corresponding encoding methods that are used to build the initial 2D atomic feature vectors; some key hyper parameters used for model building and training; 95% confidence interval on test set (10,000 samples) of Frag20-Aqsol-100K by each GNN under 2D and A3D featurization from bootstrapping (500,000 iterations); statistical analysis between 2D-DMPNN-TS (baseline) with other variants; and underscore bold components indicating where the difference between variant models and baseline locates (PDF)This article has not yet been cited by other publications.
chemistry, multidisciplinary, medicinal,computer science, interdisciplinary applications, information systems
What problem does this paper attempt to address?