Hybrid protein-ligand binding residue prediction with protein language models: Does the structure matter?

Hamza Gamouh,David Hoksza,Marian Novotny
DOI: https://doi.org/10.1101/2023.08.11.553028
2024-07-10
Abstract:Predicting protein-ligand binding sites is crucial in studying protein interactions with applications in biotechnology and drug discovery. Two distinct paradigms have emerged for this purpose: sequence-based methods, which leverage protein sequence information, and structure-based methods, which rely on the three-dimensional (3D) structure of the protein. We propose to study a hybrid approach combining both paradigms' strengths by integrating two recent deep learning architectures: protein language models (pLMs) from the sequence-based paradigm and Graph Neural Networks (GNNs) from the structure-based paradigm. Specifically, we construct a residue-level Graph Attention Network (GAT) model based on the protein's 3D structure that uses pre-trained pLM embeddings as node features. This integration enables us to study the interplay between the sequential information encoded in the protein sequence and the spatial relationships within the protein structure on the model's performance. By exploiting a benchmark dataset over a range of ligands and ligand types, we have shown that using the structure information consistently enhances the predictive power of baselines in absolute terms. Nevertheless, as more complex pLMs are employed to represent node features, the relative impact of the structure information represented by the GNN architecture diminishes. The above observations suggest that, although using the experimental protein structure almost always improves the accuracy binding site prediction, complex pLMs still contain structural information that lead to good predictive performance even without using 3D structure.
Bioinformatics
What problem does this paper attempt to address?
This paper discusses how to combine protein sequence information and structural information to predict protein-ligand binding sites, which are the key regions where proteins bind to small molecules. In this study, the authors propose a hybrid approach that utilizes protein language models (pLMs) to obtain information from the sequence perspective and graph neural networks (GNNs) to obtain information from the structural perspective. They construct a graph attention network (GAT) model based on 3D structure, with pre-trained pLM embeddings as node features. The experiments show that structural information does improve prediction accuracy, but more complex pLMs can capture structural information even without using 3D structure, achieving good prediction performance. The main research question of the paper is: how to effectively integrate protein sequence and structure data to improve the accuracy of protein-ligand binding site prediction.