MISATO - Machine Learning Dataset for Structure-Based Drug Discovery

Till Siebenmorgen,Filipe Menezes,Sabrina Benassou,Erinç Merdivan,Stefan Kesselheim,Marie Piraud,Fabian J. Theis,Michael Sattler,Grzegorz M. Popowicz
DOI: https://doi.org/10.5281/zenodo.7711952
2023-01-01
Abstract:Developments in Artificial Intelligence (AI) have had an enormous impact on scientific research in recent years. Yet, relatively few robust methods have been reported in the field of structure-based drug discovery. To train AI models to abstract from structural data, highly curated and precise biomolecule-ligand interaction datasets are urgently needed. We present MISATO, a curated dataset of almost 20000 experimental structures of protein-ligand complexes, associated molecular dynamics traces, and electronic properties. Semi-empirical quantum mechanics was used to systematically refine protonation states of proteins and small molecule ligands. Molecular dynamics traces for protein-ligand complexes were obtained in explicit water. The dataset is made readily available to the scientific community via simple python data-loaders. AI baseline models are provided for dynamical and electronic properties. This highly curated dataset is expected to enable the next-generation of AI models for structure-based drug discovery. Our vision is to make MISATO the first step of a vibrant community project for the development of powerful AI-based drug discovery tools.
What problem does this paper attempt to address?