QuanDB: a quantum chemical property database towards enhancing 3D molecular representation learning

Zhijiang Yang,Tengxin Huang,Li Pan,Jingjing Wang,Liangliang Wang,Junjie Ding,Junhua Xiao
DOI: https://doi.org/10.1186/s13321-024-00843-y
2024-05-01
Journal of Cheminformatics
Abstract:Previous studies have shown that the three-dimensional (3D) geometric and electronic structure of molecules play a crucial role in determining their key properties and intermolecular interactions. Therefore, it is necessary to establish a quantum chemical (QC) property database containing the most stable 3D geometric conformations and electronic structures of molecules. In this study, a high-quality QC property database, called QuanDB, was developed, which included structurally diverse molecular entities and featured a user-friendly interface. Currently, QuanDB contains 154,610 compounds sourced from public databases and scientific literature, with 10,125 scaffolds. The elemental composition comprises nine elements: H, C, O, N, P, S, F, Cl, and Br. For each molecule, QuanDB provides 53 global and 5 local QC properties and the most stable 3D conformation. These properties are divided into three categories: geometric structure, electronic structure, and thermodynamics. Geometric structure optimization and single point energy calculation at the theoretical level of B3LYP-D3(BJ)/6-311G(d)/SMD/water and B3LYP-D3(BJ)/def2-TZVP/SMD/water, respectively, were applied to ensure highly accurate calculations of QC properties, with the computational cost exceeding 107 core-hours. QuanDB provides high-value geometric and electronic structure information for use in molecular representation models, which are critical for machine-learning-based molecular design, thereby contributing to a comprehensive description of the chemical compound space. As a new high-quality dataset for QC properties, QuanDB is expected to become a benchmark tool for the training and optimization of machine learning models, thus further advancing the development of novel drugs and materials. QuanDB is freely available, without registration, at https://quandb.cmdrg.com/.
chemistry, multidisciplinary,computer science, interdisciplinary applications, information systems
What problem does this paper attempt to address?
The paper introduces a quantum chemical properties database called QuanDB, aiming to address the problem of enhancing the learning of three-dimensional molecular representation. Currently, the three-dimensional geometry and electronic structure of molecules are crucial for their key properties and intermolecular interactions. QuanDB provides diverse structures of compounds from public databases and literature, totaling 154,610, with quantum chemical properties data for nine different elements, all presented in a user-friendly interface. QuanDB contains 53 global and local quantum chemical properties, divided into three categories: geometric structure, electronic structure, and thermodynamics. These data are obtained through high-precision calculations using the B3LYP-D3(BJ) theoretical level to ensure accuracy. The database offers a way to include the three-dimensional electronic structure information of molecules, which is of significant value for machine learning-driven drug and material design. Compared to other databases, QuanDB covers a wider range of chemical compound space, adopts higher-level theoretical calculations, and provides an intuitive user interface. It enriches and complements the information of molecular structure representation, serving as a benchmark for machine learning model training and optimization, and promoting the research and development of new drugs and materials. QuanDB is a free resource that can be accessed online without registration.