Sequence-based Protein-Protein Interaction Prediction Using Multi-kernel Deep Convolutional Neural Networks with Protein Language Model

Thanh Hai Dang,Tien Anh Vu
DOI: https://doi.org/10.1101/2023.10.03.560728
2024-03-10
Abstract:Predicting protein-protein interactions (PPIs) using only sequence information represents a fundamental problem in biology. In the past five years, a wide range of state-of-the-art deep learning models have been developed to address the computational prediction of PPIs based on sequences. Convolutional neural networks (CNNs) are widely adopted in these model architectures; however, the design of a deep and wide CNN architecture that comprehensively extracts interaction features from pairs of proteins is not well studied. Despite the development of several protein language models that distill the knowledge of evolutionary, structural, and functional information from gigantic protein sequence databases, no studies have integrated the amino acid embeddings of the protein language model for encoding protein sequences.In this study, we introduces a novel hybrid classifier, xCAPT5, which combines the deep multi-kernel convolutional accumulated pooling siamese neural network (CAPT5) and the XGBoost model (x) to enhance interaction prediction. The CAPT5 utilizes multi-deep convolutional channels with varying kernel sizes in the Siamese architecture, enabling the capture of small- and large-scale local features. By concatenating max and average pooling features in a depth-wise manner, CAPT5 effectively learns crucial features with low computational cost. This study is the first to extract information-rich amino acid embedding from a protein language model by a deep convolutional network, through training to obtain discriminant representations of protein sequence pairs that are fed into XGBoost for predicting PPIs. Experimental results demonstrate that xCAPT5 outperforms several state-of-the-art methods on binary PPI prediction, including generalized PPI on intra-species, cross-species, inter-species, and stringent similarity tasks. The implementation of our framework is available at
Bioinformatics
What problem does this paper attempt to address?
This paper mainly discusses the problem of predicting protein-protein interactions (PPIs) based on sequence information. In recent years, many deep learning models have been developed for sequence-based PPI prediction. However, there is still a lack of research on designing a deep and wide convolutional neural network architecture that can comprehensively extract interaction features in protein pairs. The paper proposes a novel hybrid classifier called xCAPT5, which combines CAPT5 (multi-kernel convolutional-accumulative pooling twin neural network) and XGBoost model to enhance interaction prediction capability. CAPT5 captures local and global features in the twin architecture by using convolutional filters of different sizes, while xCAPT5 encodes protein sequences by training protein language models with amino acid embeddings, which are then input into XGBoost for PPI prediction. Experimental results show that xCAPT5 outperforms several state-of-the-art methods in binary PPI prediction tasks, including intra-species, inter-species, inter-kingdom, and strict similarity tasks. The paper also introduces the five-stage architecture of xCAPT5, including amino acid encoding layer, protein sequence learning layer, protein pair learning layer, intermediate layer, and prediction layer. Each stage utilizes the pre-trained protein language model ProtT5-XL-UniRef50 to capture the evolutionary, physicochemical, and structural information of amino acids, and extracts features through a combination of deep multi-kernel convolutional neural networks and global average pooling with maximum pooling. In summary, this paper aims to address how to more effectively utilize sequence information to predict interactions between proteins, and improves the accuracy of prediction with innovative deep learning models.