An Open Quantum Chemistry Property Database of 120 Kilo Molecules with 20 Million Conformers

Weiqi Liu,Xi Ai,Zhijian Zhou,Chao Qu,Junyi An,Zhipeng Zhou,Yuan Cheng,Yinghui Xu,Fenglei Cao,Alan Qi
2024-10-25
Abstract:Artificial intelligence is revolutionizing computational chemistry, bringing unprecedented innovation and efficiency to the field. To further advance research and expedite progress, we introduce the Quantum Open Organic Molecular (QO2Mol) database -- a large-scale quantum chemistry dataset designed for professional and transformative research in organic molecular sciences under an open-source license. The database comprises 120,000 organic molecules and approximately 20 million conformers, encompassing 10 different elements (C, H, O, N, S, P, F, Cl, Br, I), with heavy atom counts exceeding 40. Utilizing the high-precision B3LYP/def2-SVP quantum mechanical level, each conformation was meticulously computed for quantum mechanical properties, including potential energy and forces. These molecules are derived from fragments of compounds in ChEMBL, ensuring their structural relevance to real-world compounds. Its extensive coverage of molecular structures and diverse elemental composition enables comprehensive studies of structure-property relationships, enhancing the accuracy and applicability of machine learning models in predicting molecular behaviors. The QO2Mol database and benchmark codes are available at <a class="link-external link-https" href="https://github.com/saiscn/QO2Mol/" rel="external noopener nofollow">this https URL</a> .
Chemical Physics
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the current lack of large-scale public quantum chemistry datasets, which limits the in-depth development of artificial intelligence and computational chemistry experts in the field of small organic molecule research. Specifically: 1. **Limitations of Existing Datasets**: - **Insufficient Element Diversity**: Existing public quantum chemistry datasets usually contain a limited variety of elements, failing to comprehensively cover the multiple elements found in actual compounds. - **Insufficient Molecular Diversity**: Many datasets contain a small number of molecules, and most of these molecules have a low number of heavy atoms, lacking broadness and representativeness. - **Small Sample Size**: The sample size of existing datasets is small, insufficient to support the training and validation of large-scale models. 2. **Growing Research Needs**: - **Drug Discovery**: Computer-aided drug design (CADD) technology relies on accurate modeling of small organic molecules to identify and optimize lead compounds. - **Materials Science**: The study of physicochemical properties requires a large amount of high-quality data to develop polymer composites, catalysts, and discover new chemical reactions. - **Physiology and Pathology**: Medical experts focus on the various behaviors of small organic molecules in the human environment, including ADMET (absorption, distribution, metabolism, excretion, toxicity) and molecular metabolism. 3. **Solutions**: - **Construction of the QO2Mol Database**: This database contains 120,000 small organic molecules and approximately 20 million conformations, covering 10 different elements (C, H, O, N, S, P, F, Cl, Br, I), with more than 40 heavy atoms. These molecules are extracted from the ChEMBL compound library, ensuring their structures are relevant to actual compounds. - **High-Precision Calculations**: Using high-precision B3LYP/def2-SVP quantum mechanical methods, the quantum mechanical properties of each conformation, including potential energy and forces, are calculated. - **Open Source**: The QO2Mol database and benchmark code are released under an open-source license for use by the global scientific community. Through these measures, the QO2Mol database aims to accelerate progress in the fields of computational chemistry, materials science, and drug discovery, providing high-quality data resources to support more accurate molecular modeling.