Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks

Yanqiao Zhu,Jeehyun Hwang,Keir Adams,Zhen Liu,Bozhao Nan,Brock Stenfors,Yuanqi Du,Jatin Chauhan,Olaf Wiest,Olexandr Isayev,Connor W. Coley,Yizhou Sun,Wei Wang
2024-07-28
Abstract:Molecular Representation Learning (MRL) has proven impactful in numerous biochemical applications such as drug discovery and enzyme design. While Graph Neural Networks (GNNs) are effective at learning molecular representations from a 2D molecular graph or a single 3D structure, existing works often overlook the flexible nature of molecules, which continuously interconvert across conformations via chemical bond rotations and minor vibrational perturbations. To better account for molecular flexibility, some recent works formulate MRL as an ensemble learning problem, focusing on explicitly learning from a set of conformer structures. However, most of these studies have limited datasets, tasks, and models. In this work, we introduce the first MoleculAR Conformer Ensemble Learning (MARCEL) benchmark to thoroughly evaluate the potential of learning on conformer ensembles and suggest promising research directions. MARCEL includes four datasets covering diverse molecule- and reaction-level properties of chemically diverse molecules including organocatalysts and transition-metal catalysts, extending beyond the scope of common GNN benchmarks that are confined to drug-like molecules. In addition, we conduct a comprehensive empirical study, which benchmarks representative 1D, 2D, and 3D molecular representation learning models, along with two strategies that explicitly incorporate conformer ensembles into 3D MRL models. Our findings reveal that direct learning from an accessible conformer space can improve performance on a variety of tasks and models.
Machine Learning
What problem does this paper attempt to address?
The paper primarily addresses a key issue in Molecular Representation Learning (MRL): how to better handle the conformational flexibility of molecules. Specifically, the paper attempts to solve the following core problems: 1. **The issue of molecular conformational flexibility**: Traditional methods often focus only on a single molecular conformation or 2D molecular graph, ignoring the fact that molecules continuously change their conformation through chemical bond rotations in reality. This conformational change is crucial for understanding molecular properties. 2. **Limitations of existing datasets and tasks**: Most existing research is limited to small-scale datasets, limited task settings, and specific types of model architectures, which restricts our understanding of the potential of molecular representation learning. 3. **Limitations of 3D models**: Although 3D Graph Neural Networks (GNNs) can effectively capture the spatial structural information of molecules, they usually encode only a single conformation and fail to fully utilize the information from the entire set of conformations. To address these issues, the paper proposes the MARCEL (Molecular ARConformer Ensemble Learning) benchmark, a comprehensive benchmark for evaluating the potential of learning molecular representations based on conformer ensembles. MARCEL includes four different datasets, covering a chemically diverse range of molecules from drug molecules to organic catalysts and transition metal catalysts, and designs a series of tasks to test the performance of different models in handling conformer ensembles. Additionally, the paper explores two strategies to explicitly utilize conformer ensembles to enhance the performance of 3D models, including data augmentation during training and using explicit ensemble encoders for ensemble learning. - **Data augmentation during training**: By randomly selecting different conformations of each molecule during training to enrich the training data, this method can help the model learn the invariance of conformational changes. - **Ensemble learning with explicit ensemble encoders**: By constructing an ensemble encoder to simultaneously encode the entire set of conformations, thereby aggregating information from all conformations. The paper considers three different ensemble encoders: mean pooling, DeepSets, and self-attention mechanisms. Through these strategies and techniques, the paper aims to demonstrate that learning directly from the accessible conformational space can significantly improve the performance of various tasks and models, and point the way for future research.