Pre-training of Molecular GNNs via Conditional Boltzmann Generator

Daiki Koge,Naoaki Ono,Shigehiko Kanaya
2024-01-19
Abstract:Learning representations of molecular structures using deep learning is a fundamental problem in molecular property prediction tasks. Molecules inherently exist in the real world as three-dimensional structures; furthermore, they are not static but in continuous motion in the 3D Euclidean space, forming a potential energy surface. Therefore, it is desirable to generate multiple conformations in advance and extract molecular representations using a 4D-QSAR model that incorporates multiple conformations. However, this approach is impractical for drug and material discovery tasks because of the computational cost of obtaining multiple conformations. To address this issue, we propose a pre-training method for molecular GNNs using an existing dataset of molecular conformations to generate a latent vector universal to multiple conformations from a 2D molecular graph. Our method, called Boltzmann GNN, is formulated by maximizing the conditional marginal likelihood of a conditional generative model for conformations generation. We show that our model has a better prediction performance for molecular properties than existing pre-training methods using molecular graphs and three-dimensional molecular structures.
Machine Learning,Chemical Physics,Biomolecules
What problem does this paper attempt to address?
The paper mainly aims to address two key issues in the task of molecular property prediction: 1. **Molecular Structure Representation Learning**: How to effectively learn representations of molecular structures using deep learning methods, particularly Graph Neural Networks (GNNs). Accurate prediction of the biological or physicochemical properties of molecules is crucial in drug discovery and materials design. 2. **Multi-Conformation Ensemble Processing**: Molecules are not static in three-dimensional space but are in constant motion, forming a potential energy surface (PES). Therefore, to more accurately predict the properties of molecules, it is necessary to consider all possible conformations (i.e., multi-conformation ensemble). However, traditional methods for generating these conformation ensembles (such as molecular dynamics simulations) are computationally very expensive and not feasible for practical applications. To address these issues, the paper proposes a new pre-training method called Boltzmann GNN, which generates latent vectors suitable for multiple conformations using existing molecular conformation datasets. Specifically, this method combines the Geometric Diffusion Model (Geodiff) and Graph Transformation Network (GTN) to extract information from 2D molecular graphs that can represent multiple conformations and approximates the Boltzmann distribution through a conditional generative model. Experimental results show that this method performs excellently in molecular property prediction tasks on small sample datasets, outperforming existing pre-training methods.