GEOM: Energy-annotated molecular conformations for property prediction and molecular generation

Simon Axelrod,Rafael Gomez-Bombarelli
DOI: https://doi.org/10.48550/arXiv.2006.05531
2022-02-10
Abstract:Machine learning (ML) outperforms traditional approaches in many molecular design tasks. ML models usually predict molecular properties from a 2D chemical graph or a single 3D structure, but neither of these representations accounts for the ensemble of 3D conformers that are accessible to a molecule. Property prediction could be improved by using conformer ensembles as input, but there is no large-scale dataset that contains graphs annotated with accurate conformers and experimental data. Here we use advanced sampling and semi-empirical density functional theory (DFT) to generate 37 million molecular conformations for over 450,000 molecules. The Geometric Ensemble Of Molecules (GEOM) dataset contains conformers for 133,000 species from QM9, and 317,000 species with experimental data related to biophysics, physiology, and physical chemistry. Ensembles of 1,511 species with BACE-1 inhibition data are also labeled with high-quality DFT free energies in an implicit water solvent, and 534 ensembles are further optimized with DFT. GEOM will assist in the development of models that predict properties from conformer ensembles, and generative models that sample 3D conformations.
Computational Physics,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the problem of how to more accurately predict molecular properties and generate new molecular conformations in molecular design tasks. Traditional machine learning (ML) methods typically predict molecular properties based on 2D chemical graphs or single 3D structures, but these representations fail to capture the multiple 3D conformations (conformers) accessible to molecules at finite temperatures. Therefore, using datasets that include high-quality sets of conformers can improve the performance of property prediction models and generative models. Specifically, the paper proposes a dataset named Geometric Ensemble Of Molecules (GEOM), which contains 37 million conformers of over 450,000 molecules. These conformers are generated through advanced sampling and semi-empirical density functional theory (DFT), covering experimental data related to biophysics, physiology, and physical chemistry. The uniqueness of the GEOM dataset lies in its scale, the number of conformers per species, the quality of the conformers, and the connection to experimental data, making it an ideal choice for training and evaluating property prediction models and generative models based on conformer sets. ### Main Contributions: 1. **Dataset Construction**: Generated 37 million high-quality molecular conformers, covering 133,000 QM9 molecules and 317,000 medium-sized organic molecules with experimental data. 2. **Conformer Set Annotation**: High-precision DFT free energy annotation for BACE-1 inhibition data of 1,511 species, and further DFT optimization for 534 conformer sets. 3. **Application Prospects**: The GEOM dataset can be used to train models to predict experimental properties from conformer sets and to generate new 3D conformers, thereby improving the accuracy and efficiency of molecular design tasks. ### Problems Addressed: - **Property Prediction**: Existing ML models typically rely on 2D chemical graphs or single 3D structures, failing to capture molecular flexibility and conformational diversity. The GEOM dataset provides sets of multiple conformers, enabling models to more accurately predict molecular properties. - **Conformer Generation**: Generating high-quality 3D conformers is a computationally intensive task. The GEOM dataset provides a large number of high-quality conformers that can be used to train generative models, reducing computational costs and improving generation speed and accuracy. ### Significance: - **Benchmarking**: The GEOM dataset can serve as a benchmark for evaluating the performance of new models, particularly in predicting antiviral activity, physicochemical properties, and physiological properties. - **Generative Models**: By training generative models, new 3D conformers can be generated more quickly and reliably, which is of great significance for fields such as drug design and materials science. In summary, by constructing the GEOM dataset, the paper addresses the limitations of existing methods in molecular property prediction and conformer generation, providing strong support for future molecular design tasks.