Improving the generalizability of protein-ligand binding predictions with AI-Bind

Ayan Chatterjee,Robin Walters,Zohair Shafi,Omair Shafi Ahmed,Michael Sebek,Deisy Gysi,Rose Yu,Tina Eliassi-Rad,Albert-László Barabási,Giulia Menichetti
DOI: https://doi.org/10.1038/s41467-023-37572-z
IF: 16.6
2023-04-08
Nature Communications
Abstract:Abstract Identifying novel drug-target interactions is a critical and rate-limiting step in drug discovery. While deep learning models have been proposed to accelerate the identification process, here we show that state-of-the-art models fail to generalize to novel (i.e., never-before-seen) structures. We unveil the mechanisms responsible for this shortcoming, demonstrating how models rely on shortcuts that leverage the topology of the protein-ligand bipartite network, rather than learning the node features. Here we introduce AI-Bind, a pipeline that combines network-based sampling strategies with unsupervised pre-training to improve binding predictions for novel proteins and ligands. We validate AI-Bind predictions via docking simulations and comparison with recent experimental evidence, and step up the process of interpreting machine learning prediction of protein-ligand binding by identifying potential active binding sites on the amino acid sequence. AI-Bind is a high-throughput approach to identify drug-target combinations with the potential of becoming a powerful tool in drug discovery.
multidisciplinary sciences
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the issue of insufficient generalization ability in protein-ligand binding prediction. Specifically, existing deep learning models perform poorly when predicting the binding between new, unseen proteins and ligands. These models tend to rely on the topological structure of protein-ligand interaction networks rather than learning molecular structural features, which limits their effectiveness when dealing with new data. ### Background and Motivation 1. **Key Steps in Drug Discovery**: - Identifying new drug target interactions is a critical and rate-limiting step in the drug discovery process. - Deep learning models have been proposed to accelerate this process, but existing models fail to generalize to new structures. 2. **Limitations of Existing Models**: - Existing models rely on the topological structure of protein-ligand interaction networks rather than learning node features (such as chemical structures). - This reliance leads to poor performance when dealing with unseen proteins and ligands. - Data annotation imbalance (some proteins and ligands have more positive annotations while others have fewer) further exacerbates this issue. ### Solution 1. **Introducing AI-Bind**: - AI-Bind is a pipeline that combines network science methods and unsupervised pre-training to improve the prediction of binding between new proteins and ligands. - By using network-derived negative samples and unsupervised pre-training, AI-Bind can control overfitting to existing libraries and the issue of annotation imbalance. 2. **Specific Methods**: - Use shortest path distance to identify distant protein-ligand pairs in the network as negative samples. - Combine experimentally validated non-binding protein-ligand pairs to ensure each node has sufficient positive and negative samples in the training data. - Unsupervised learning of node feature representations, including the chemical structure of ligands and the amino acid sequences of proteins. 3. **Validation and Results**: - Validate AI-Bind's predictions through docking simulations and comparison with the latest experimental results. - Tests on COVID-19 related proteins show that AI-Bind's predictions are highly reliable and accurate. ### Summary This paper addresses the shortcomings of existing deep learning models in generalizing to new proteins and ligands by introducing AI-Bind. By combining network science methods and unsupervised pre-training, AI-Bind improves the prediction capability for new data, providing a powerful tool for drug discovery.