SubGDiff: A Subgraph Diffusion Model to Improve Molecular Representation Learning

Jiying Zhang,Zijing Liu,Yu Wang,Yu Li
2024-05-09
Abstract:Molecular representation learning has shown great success in advancing AI-based drug discovery. The core of many recent works is based on the fact that the 3D geometric structure of molecules provides essential information about their physical and chemical characteristics. Recently, denoising diffusion probabilistic models have achieved impressive performance in 3D molecular representation learning. However, most existing molecular diffusion models treat each atom as an independent entity, overlooking the dependency among atoms within the molecular substructures. This paper introduces a novel approach that enhances molecular representation learning by incorporating substructural information within the diffusion process. We propose a novel diffusion model termed SubGDiff for involving the molecular subgraph information in diffusion. Specifically, SubGDiff adopts three vital techniques: i) subgraph prediction, ii) expectation state, and iii) k-step same subgraph diffusion, to enhance the perception of molecular substructure in the denoising network. Experimentally, extensive downstream tasks demonstrate the superior performance of our approach. The code is available at
Machine Learning,Quantitative Methods
What problem does this paper attempt to address?
The paper aims to address two key issues in molecular representation learning: 1. **How to enhance molecular representation learning by improving diffusion models**: Most existing molecular diffusion models treat each atom as an independent entity, ignoring the dependencies between substructures within a molecule. To address this limitation, the paper proposes a new diffusion model—SubGDiff, which introduces molecular subgraph information during the diffusion process to improve the effectiveness of molecular representation learning. 2. **How to effectively utilize molecular substructure information**: SubGDiff achieves this goal through the following three key technical points: - Subgraph Prediction: Used to select the substructures to which noise will be added. - Expectation State Diffusion: Optimizes sampling capability. - K-Step Same Subgraph Diffusion: Optimizes the model during the training phase. Specifically, SubGDiff adds different Gaussian noise to different molecular substructures during the diffusion process and integrates a subgraph prediction loss during the training phase to guide the denoising network in capturing molecular substructure information. Additionally, SubGDiff employs the expectation state diffusion process and the K-step same subgraph diffusion process to further optimize model performance. The experimental section demonstrates the superior performance of SubGDiff in various 2D and 3D molecular property prediction tasks, proving that this method can effectively enhance the capability of molecular representation learning. At the same time, SubGDiff also shows good results in the molecular conformation generation task, displaying significant advantages over the benchmark model GeoDiff. In summary, the main contribution of this research is the proposal of a new diffusion model, SubGDiff, which can effectively integrate molecular substructure information, thereby achieving significant improvements in molecular representation learning.