Aquamarine: Quantum-Mechanical Exploration of Conformers and Solvent Effects in Large Drug-like Molecules

Leonardo Rafael Medrano Sandonas,Dries Van Rompaey,Alessio Fallani,Mathias Hilfiker,David Hahn,Laura Perez-Benito,Jonas Verhoeven,Gary Tresadern,Joerg Kurt Wegner,Hugo Ceulemans,Alexandre Tkatchenko
DOI: https://doi.org/10.26434/chemrxiv-2024-685qb
2024-02-21
Abstract:We here introduce the Aquamarine (AQM) dataset, an extensive quantum-mechanical (QM) dataset that contains the structural and electronic information of 59,783 low-and high-energy conformers of 1,653 molecules with a total number of atoms ranging from 2 to 92 (mean:50.9), and containing up to 54 (mean:28.2) non-hydrogen atoms. To gain insights into the solvent effects as well as collective dispersion interactions for drug-like molecules, we have performed QM calculations supplemented with a treatment of many-body dispersion (MBD) interactions of structures and properties in the gas phase and implicit water. Thus, AQM contains over 40 global (molecular) and local (atom-in-a-molecule) physicochemical properties (including ground-state and response properties) per conformer computed at the tightly converged PBE0+MBD level of theory for gas-phase molecules, whereas PBE0+MBD with the modified Poisson-Boltzmann (MPB) model of water was used for solvated molecules. By addressing both molecule-solvent and dispersion interactions, AQM dataset can serve as a challenging benchmark for state-of-the-art machine learning methods for property modeling and \textit{de novo} generation of large (solvated) molecules with pharmaceutical and biological relevance.
Chemistry
What problem does this paper attempt to address?
The paper focuses on how to perform quantum mechanical (QM) calculations more effectively in drug development to understand and predict the structure, electronic properties, and solvent effects of large drug molecules. Existing QM datasets often contain smaller organic molecules and do not consider the interaction between molecules and solvents, limiting their application in drug discovery. The paper introduces the Aquamarine (AQM) dataset, which is a large-scale QM dataset consisting of 59,783 low-energy and high-energy conformations, involving 1,653 molecules. The largest molecule in the dataset contains 92 atoms, on average 50.9 atoms, and up to 54 non-hydrogen atoms. These molecules underwent QM calculations in gas phase and hydrated state, considering many-body dispersion (MBD) and solvent effects. The AQM dataset aims to address the following issues: 1. Provide extensive conformational sampling of large drug molecules, including high-energy and low-energy conformations, to better understand their energy landscape. 2. Calculate over 40 global (molecular) and local (within atomic) physicochemical properties, including ground state and responsive properties, to comprehensively explore the structure-property and property-property relationships. 3. Consider the molecular-solvent interaction and collective dispersion effects, which are crucial for understanding the behavior of drugs in solution. 4. Improve accuracy by optimizing representative conformations using the DFTB3+MBD method and calculating properties at the PBE0+MBD level in gas phase and hydrated state. The purpose of the AQM dataset is to advance the development of next-generation machine learning models for fast and accurate prediction of the properties of drug molecules and generation of new molecules in a chemical environment. Additionally, it provides an in-depth understanding of solvent effects and dispersion forces in large drug molecules, enhancing the ability of computer-aided drug design.