Integration of protein sequence and protein–protein interaction data by hypergraph learning to identify novel protein complexes

Simin Xia,Dianke Li,Xinru Deng,Zhongyang Liu,Huaqing Zhu,Yuan Liu,Dong Li
DOI: https://doi.org/10.1093/bib/bbae274
IF: 9.5
2024-06-10
Briefings in Bioinformatics
Abstract:Protein–protein interactions (PPIs) are the basis of many important biological processes, with protein complexes being the key forms implementing these interactions. Understanding protein complexes and their functions is critical for elucidating mechanisms of life processes, disease diagnosis and treatment and drug development. However, experimental methods for identifying protein complexes have many limitations. Therefore, it is necessary to use computational methods to predict protein complexes. Protein sequences can indicate the structure and biological functions of proteins, while also determining their binding abilities with other proteins, influencing the formation of protein complexes. Integrating these characteristics to predict protein complexes is very promising, but currently there is no effective framework that can utilize both protein sequence and PPI network topology for complex prediction. To address this challenge, we have developed HyperGraphComplex, a method based on hypergraph variational autoencoder that can capture expressive features from protein sequences without feature engineering, while also considering topological properties in PPI networks, to predict protein complexes. Experiment results demonstrated that HyperGraphComplex achieves satisfactory predictive performance when compared with state-of-art methods. Further bioinformatics analysis shows that the predicted protein complexes have similar attributes to known ones. Moreover, case studies corroborated the remarkable predictive capability of our model in identifying protein complexes, including 3 that were not only experimentally validated by recent studies but also exhibited high-confidence structural predictions from AlphaFold-Multimer. We believe that the HyperGraphComplex algorithm and our provided proteome-wide high-confidence protein complex prediction dataset will help elucidate how proteins regulate cellular processes in the form of complexes, and facilitate disease diagnosis and treatment and drug development. Source codes are available at https://github.com/LiDlab/HyperGraphComplex.
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to predict new protein complexes by integrating protein sequence and protein - protein interaction (PPI) data. Specifically, the authors propose a method based on Hypergraph Variational Autoencoder (HGVAE), called HyperGraphComplex, which can extract features from protein sequences and combine the topological characteristics of the PPI network to predict protein complexes. ### Background and Problem Description Protein - protein interactions (PPIs) are the basis of many important biological processes, and protein complexes are the main form of these interactions. Understanding protein complexes and their functions is crucial for revealing the mechanisms of life processes, disease diagnosis and treatment, and drug development. However, experimental methods have many limitations in identifying protein complexes, so computational methods are needed to predict protein complexes. ### Limitations of Existing Methods 1. **Methods Relying Only on PPI Networks**: These methods are easily affected by network noise and have difficulty in effectively predicting small - scale or sparsely internally - connected complexes. 2. **Supervised Learning Methods**: These methods usually rely on feature engineering and need further research on how to fully describe the biological characteristics of protein complexes. 3. **Multi - Source Information Fusion Methods**: Due to the incomplete biological annotations of some proteins, the effectiveness of these methods is limited. ### Proposed Method The authors propose HyperGraphComplex, a method based on hypergraph learning. The main features of this method include: - **Hypergraph Variational Autoencoder (HGVAE)**: It can represent higher - order non - pairwise complex relationships and learn more complex high - order protein interaction patterns. - **Integrating Protein Sequence and PPI Network Topology**: By training the encoder and decoder to simultaneously generate the latent feature vectors of protein complexes and combining with deep neural networks (DNN) to identify candidate protein complexes. - **Fully Data - Driven**: It does not require any manually - designed features. ### Experimental Results The experimental results show that HyperGraphComplex outperforms the existing state - of - the - art methods in predicting protein complexes. Bioinformatics analysis shows that the predicted complexes have biological properties similar to known complexes. In addition, case studies confirm the significant predictive ability of this model in identifying protein complexes, including three recently experimentally - verified complexes, which also obtained high - confidence structural predictions through AlphaFold - Multimer. ### Conclusion The HyperGraphComplex algorithm and the whole - genome high - confidence protein complex prediction data set it provides are helpful for clarifying how proteins regulate cell processes in the form of complexes, and promoting disease diagnosis and treatment and drug development.