Abstract:Abstract Identifying novel drug-target interactions is a critical and rate-limiting step in drug discovery. While deep learning models have been proposed to accelerate the identification process, here we show that state-of-the-art models fail to generalize to novel (i.e., never-before-seen) structures. We unveil the mechanisms responsible for this shortcoming, demonstrating how models rely on shortcuts that leverage the topology of the protein-ligand bipartite network, rather than learning the node features. Here we introduce AI-Bind, a pipeline that combines network-based sampling strategies with unsupervised pre-training to improve binding predictions for novel proteins and ligands. We validate AI-Bind predictions via docking simulations and comparison with recent experimental evidence, and step up the process of interpreting machine learning prediction of protein-ligand binding by identifying potential active binding sites on the amino acid sequence. AI-Bind is a high-throughput approach to identify drug-target combinations with the potential of becoming a powerful tool in drug discovery.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the issue of insufficient generalization ability in protein-ligand binding prediction. Specifically, existing deep learning models perform poorly when predicting the binding between new, unseen proteins and ligands. These models tend to rely on the topological structure of protein-ligand interaction networks rather than learning molecular structural features, which limits their effectiveness when dealing with new data. ### Background and Motivation 1. **Key Steps in Drug Discovery**: - Identifying new drug target interactions is a critical and rate-limiting step in the drug discovery process. - Deep learning models have been proposed to accelerate this process, but existing models fail to generalize to new structures. 2. **Limitations of Existing Models**: - Existing models rely on the topological structure of protein-ligand interaction networks rather than learning node features (such as chemical structures). - This reliance leads to poor performance when dealing with unseen proteins and ligands. - Data annotation imbalance (some proteins and ligands have more positive annotations while others have fewer) further exacerbates this issue. ### Solution 1. **Introducing AI-Bind**: - AI-Bind is a pipeline that combines network science methods and unsupervised pre-training to improve the prediction of binding between new proteins and ligands. - By using network-derived negative samples and unsupervised pre-training, AI-Bind can control overfitting to existing libraries and the issue of annotation imbalance. 2. **Specific Methods**: - Use shortest path distance to identify distant protein-ligand pairs in the network as negative samples. - Combine experimentally validated non-binding protein-ligand pairs to ensure each node has sufficient positive and negative samples in the training data. - Unsupervised learning of node feature representations, including the chemical structure of ligands and the amino acid sequences of proteins. 3. **Validation and Results**: - Validate AI-Bind's predictions through docking simulations and comparison with the latest experimental results. - Tests on COVID-19 related proteins show that AI-Bind's predictions are highly reliable and accurate. ### Summary This paper addresses the shortcomings of existing deep learning models in generalizing to new proteins and ligands by introducing AI-Bind. By combining network science methods and unsupervised pre-training, AI-Bind improves the prediction capability for new data, providing a powerful tool for drug discovery.

Improving the generalizability of protein-ligand binding predictions with AI-Bind

AI-Bind: Improving Binding Predictions for Novel Protein Targets and Ligands

On Machine Learning Approaches for Protein-Ligand Binding Affinity Prediction

Binding Affinity Prediction with 3D Machine Learning: Training Data and Challenging External Testing

Learning Binding Affinities via Fine-tuning of Protein and Ligand Language Models

Binding Affinity Prediction: From Conventional to Machine Learning-Based Approaches

BigBind: Learning from Nonstructural Data for Structure-Based Virtual Screening

DynamicBind: predicting ligand-specific protein-ligand complex structure with a deep equivariant generative model

[Advances in using artificial intelligence for predicting protein-ligand binding affinity]

A new paradigm for applying deep learning to protein–ligand interaction prediction

Synergistic Application of Molecular Docking and Machine Learning for Improved Binding Pose

ProBID-Net: A Deep Learning Model for Protein-Protein Binding Interface Design

Enhancing Drug-Target Binding Affinity Prediction through Deep Learning and Protein Secondary Structure Integration

DeepREAL: A Deep Learning Powered Multi-scale Modeling Framework Towards Predicting Out-of-distribution Receptor Activity of Ligand Binding

Artificial intelligence in the prediction of protein–ligand interactions: recent advances and future directions

Development and evaluation of a deep learning model for protein-ligand binding affinity prediction

DEELIG: A Deep Learning Approach to Predict Protein-Ligand Binding Affinity

Improved Drug-target Interaction Prediction with Intermolecular Graph Transformer

GAABind: a Geometry-Aware Attention-Based Network for Accurate Protein-Ligand Binding Pose and Binding Affinity Prediction

ZeroBind: a protein-specific zero-shot predictor with subgraph matching for drug-target interactions

Improved prediction of ligand-protein binding affinities by meta-modeling