Deep Learning-Ready Voxel Representation of Protein-Ligand Complexes from an Enhanced PBDbind v.2020 Dataset

Isabella Alvim Guedes,Matheus Müller Pereira da Silva,Fábio Lima Custódio,Laurent Emmanuel Dardenne

DOI: https://doi.org/10.26434/chemrxiv-2023-f4q6k

2023-12-11

Abstract:A critical aspect of successful deep learning (DL) modelling in computer-aided drug discovery (CADD) is the representation of biomolecular data. Voxel grid representations have emerged as a straightforward method for depicting 3D molecular structures of protein-ligand complexes. Proper structural preparation of these complexes is also crucial, particularly in models where the orientation of hydrogen atoms and the accurate assignment of protonation/tautomeric states are vital. The PDBbind, a widely used dataset, can be improved in this regard. This work presents an enhanced version of the PDBbind v.2020 refined set concerning structural preparation, a voxel representation of these structures suitable for DL model training and a diverse set of docking-generated poses that could be used to develop new scoring functions for pose prediction. We also introduce DockTGrid, a software library developed to generate these voxel representations, which can be adapted to create new molecular features. With this work, we aim to provide the CADD community with high-quality, accessible resources to facilitate the development of DL models for drug discovery.

Chemistry

What problem does this paper attempt to address?

There's a paper focusing on a key issue in computer-aided drug discovery (CADD) - the representation of biomolecular data for deep learning modeling. The researchers proposed an improved volumetric representation of protein-ligand complexes to adapt to the training of deep learning models. They also provided an enhanced version of the PDBbind v.2020 dataset, which includes structure preprocessing, voxel representation, and multiple binding conformations generated for docking. The dataset aims to be used for developing new scoring functions and deep learning models for drug discovery. While the PDBbind database mentioned in the paper is widely used for the development and benchmarking of molecular modeling tools, the accuracy of the protonation/tautomeric states of protein-ligand complexes can be improved. Incorrect protonation or tautomeric states may lead to incorrect descriptions of protein-ligand interactions. Therefore, the authors performed structure preprocessing, including assigning protonation states using the Protein Preparation Wizard in Maestro software, and optimizing hydrogen bond networks. In addition, the researchers developed an open-source Python library called DockTGrid for generating voxelized datasets and allowing users to customize molecular features. This library utilizes GPU acceleration for faster processing speed. The dataset is divided into training, validation, and test sets to avoid data leakage issues, and multiple binding conformations are generated through re-docking experiments using the DockThor program. Overall, the goal of this paper is to provide high-quality and accessible resources for drug discovery and structural biology researchers, facilitating the application of deep learning models in drug discovery.

Deep Learning-Ready Voxel Representation of Protein-Ligand Complexes from an Enhanced PBDbind v.2020 Dataset

Deep Learning Strategies for Enhanced Molecular Docking and Virtual Screening

Deep Learning Model for Efficient Protein–Ligand Docking with Implicit Side-Chain Flexibility

Deep Learning for Protein-Ligand Docking: Are We There Yet?

Binding Affinity Prediction with 3D Machine Learning: Training Data and Challenging External Testing

Protein docking model evaluation by 3D deep convolutional neural networks

Exploring protein–ligand binding affinity prediction with electron density-based geometric deep learning

Enhancing Ligand Pose Sampling for Molecular Docking

DeepDock: Enhancing Ligand-protein Interaction Prediction by a Combination of Ligand and Structure Information

Structure-based drug design by denoising voxel grids

DeepBindGCN: Integrating Molecular Vector Representation with Graph Convolutional Neural Networks for Protein–Ligand Interaction Prediction

Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-ligand Structure Prediction Models

BigBind: Learning from Nonstructural Data for Structure-Based Virtual Screening

DEELIG: A Deep Learning Approach to Predict Protein-Ligand Binding Affinity

Combining Docking Pose Rank and Structure with Deep Learning Improves Protein–Ligand Binding Mode Prediction over a Baseline Docking Approach

Boosting Docking-Based Virtual Screening with Deep Learning

A new paradigm for applying deep learning to protein–ligand interaction prediction

DeltaDock: A Unified Framework for Accurate, Efficient, and Physically Reliable Molecular Docking

Binding-Adaptive Diffusion Models for Structure-Based Drug Design

Addressing docking pose selection with structure-based deep learning: Recent advances, challenges and opportunities

CarsiDock: a deep learning paradigm for accurate protein-ligand docking and screening based on large-scale pre-training