Analysis of Atom-level pretraining with Quantum Mechanics (QM) data for Graph Neural Networks Molecular property models

Jose Arjona-Medina,Ramil Nugmanov
2024-05-27
Abstract:Despite the rapid and significant advancements in deep learning for Quantitative Structure-Activity Relationship (QSAR) models, the challenge of learning robust molecular representations that effectively generalize in real-world scenarios to novel compounds remains an elusive and unresolved task. This study examines how atom-level pretraining with quantum mechanics (QM) data can mitigate violations of assumptions regarding the distributional similarity between training and test data and therefore improve performance and generalization in downstream tasks. In the public dataset Therapeutics Data Commons (TDC), we show how pretraining on atom-level QM improves performance overall and makes the activation of the features distributes more Gaussian-like which results in a representation that is more robust to distribution shifts. To the best of our knowledge, this is the first time that hidden state molecular representations are analyzed to compare the effects of molecule-level and atom-level pretraining on QM data.
Machine Learning,Chemical Physics,Quantum Physics
What problem does this paper attempt to address?
The paper primarily addresses a core challenge in Quantitative Structure-Activity Relationship (QSAR) models: how to construct robust molecular representations that can effectively generalize to new compounds. Specifically, the research focuses on utilizing atom-level Quantum Mechanics (QM) data for pre-training to improve the performance of Graph Neural Networks (GNNs) in molecular property prediction tasks. The main contributions of the paper can be summarized as follows: 1. **Performance Improvement**: Through experiments on the Therapeutics Data Commons (TDC) public dataset, the authors demonstrate that pre-training based on atom-level QM data significantly enhances the performance of GNNs in downstream molecular property prediction tasks, compared to networks trained from scratch and those pre-trained at the molecular level. 2. **Feature Distribution Analysis**: Networks pre-trained at the atom level exhibit activation feature distributions in the first layer that are closer to a Gaussian distribution. This indicates that pre-training helps form smoother and more stable internal representations, which in turn aids in improving the model's learning dynamics and overall performance. 3. **Robustness to Input Distribution Changes**: Pre-trained networks show stronger robustness to changes in input distribution between the training set and the test set. Specifically, networks pre-trained at the atom level exhibit smaller differences in data distribution changes from training to testing compared to networks trained from scratch, which helps explain their superior performance on test data splits. In summary, this paper empirically demonstrates the effectiveness of atom-level pre-training and explores how it optimizes molecular representations, especially when facing common real-world challenges such as handling novel compounds or dealing with changes in data distribution.