Deep Learning-Ready Voxel Representation of Protein-Ligand Complexes from an Enhanced PBDbind v.2020 Dataset

Isabella Alvim Guedes,Matheus Müller Pereira da Silva,Fábio Lima Custódio,Laurent Emmanuel Dardenne
DOI: https://doi.org/10.26434/chemrxiv-2023-f4q6k
2023-12-11
Abstract:A critical aspect of successful deep learning (DL) modelling in computer-aided drug discovery (CADD) is the representation of biomolecular data. Voxel grid representations have emerged as a straightforward method for depicting 3D molecular structures of protein-ligand complexes. Proper structural preparation of these complexes is also crucial, particularly in models where the orientation of hydrogen atoms and the accurate assignment of protonation/tautomeric states are vital. The PDBbind, a widely used dataset, can be improved in this regard. This work presents an enhanced version of the PDBbind v.2020 refined set concerning structural preparation, a voxel representation of these structures suitable for DL model training and a diverse set of docking-generated poses that could be used to develop new scoring functions for pose prediction. We also introduce DockTGrid, a software library developed to generate these voxel representations, which can be adapted to create new molecular features. With this work, we aim to provide the CADD community with high-quality, accessible resources to facilitate the development of DL models for drug discovery.
Chemistry
What problem does this paper attempt to address?
There's a paper focusing on a key issue in computer-aided drug discovery (CADD) - the representation of biomolecular data for deep learning modeling. The researchers proposed an improved volumetric representation of protein-ligand complexes to adapt to the training of deep learning models. They also provided an enhanced version of the PDBbind v.2020 dataset, which includes structure preprocessing, voxel representation, and multiple binding conformations generated for docking. The dataset aims to be used for developing new scoring functions and deep learning models for drug discovery. While the PDBbind database mentioned in the paper is widely used for the development and benchmarking of molecular modeling tools, the accuracy of the protonation/tautomeric states of protein-ligand complexes can be improved. Incorrect protonation or tautomeric states may lead to incorrect descriptions of protein-ligand interactions. Therefore, the authors performed structure preprocessing, including assigning protonation states using the Protein Preparation Wizard in Maestro software, and optimizing hydrogen bond networks. In addition, the researchers developed an open-source Python library called DockTGrid for generating voxelized datasets and allowing users to customize molecular features. This library utilizes GPU acceleration for faster processing speed. The dataset is divided into training, validation, and test sets to avoid data leakage issues, and multiple binding conformations are generated through re-docking experiments using the DockThor program. Overall, the goal of this paper is to provide high-quality and accessible resources for drug discovery and structural biology researchers, facilitating the application of deep learning models in drug discovery.