Molecule Joint Auto-Encoding: Trajectory Pretraining with 2D and 3D Diffusion

Weitao Du,Jiujiu Chen,Xuecang Zhang,Zhiming Ma,Shengchao Liu
2023-12-06
Abstract:Recently, artificial intelligence for drug discovery has raised increasing interest in both machine learning and chemistry domains. The fundamental building block for drug discovery is molecule geometry and thus, the molecule's geometrical representation is the main bottleneck to better utilize machine learning techniques for drug discovery. In this work, we propose a pretraining method for molecule joint auto-encoding (MoleculeJAE). MoleculeJAE can learn both the 2D bond (topology) and 3D conformation (geometry) information, and a diffusion process model is applied to mimic the augmented trajectories of such two modalities, based on which, MoleculeJAE will learn the inherent chemical structure in a self-supervised manner. Thus, the pretrained geometrical representation in MoleculeJAE is expected to benefit downstream geometry-related tasks. Empirically, MoleculeJAE proves its effectiveness by reaching state-of-the-art performance on 15 out of 20 tasks by comparing it with 12 competitive baselines.
Machine Learning,Artificial Intelligence,Biomolecules
What problem does this paper attempt to address?
This paper primarily addresses a key issue in molecular representation learning: how to effectively utilize machine learning techniques to handle molecular geometric information in drug discovery. Specifically, the paper proposes MoleculeJAE (Molecule Joint Auto-Encoding), a novel pre-training method designed to jointly encode the 2D topological structure (chemical bond structure) and 3D conformational structure (geometric structure) of molecules, and enhance the performance of downstream tasks through self-supervised learning. ### Main Issues Addressed 1. **Bottleneck of Molecular Geometric Representation**: In drug discovery, molecular geometric representation is fundamental, but existing machine learning techniques face challenges in handling complex molecular data, especially in effectively utilizing large-scale unlabeled datasets for molecular property prediction and molecular generation. 2. **Representation Learning of Complex Graph Structures**: Molecular data has complex graph structures and symmetries, making it difficult to learn powerful representations from such data. Existing methods like generative models and contrastive models have limitations in handling these complex structures. 3. **Combining 2D and 3D Information**: Previous works typically handle the 2D topological structure or 3D conformational structure of molecules independently, failing to combine the two effectively. The paper attempts to improve representation learning by jointly encoding information from these two modalities. ### Core Contributions - Proposes the MoleculeJAE framework, which can simultaneously learn the 2D topological structure and 3D conformational structure of molecules, and uses a diffusion process model to simulate the enhancement trajectories of these two modalities, thereby learning the intrinsic chemical structure of molecules in a self-supervised manner based on these trajectories. - The paper also explores how to leverage the deep connection between the forward process of the diffusion model and data augmentation, and how to capture the association between the true data distribution and its noisy counterparts through joint distribution fitting. - Experimental results show that MoleculeJAE outperforms other baseline methods in multiple downstream tasks (including energy prediction of stable conformations and force prediction along molecular dynamics), achieving significant results particularly in quantum mechanics prediction tasks and force prediction tasks on the QM9 and MD17 datasets. In summary, this research aims to address key bottleneck issues in molecular representation learning and effectively enhance the performance of downstream tasks by proposing a novel self-supervised pre-training framework, MoleculeJAE.