Graph-based prediction of Protein-protein interactions with attributed signed graph embedding

Fang Yang,Kunjie Fan,Dandan Song,Huakang Lin
DOI: https://doi.org/10.1186/s12859-020-03646-8
IF: 3.307
2020-07-21
BMC Bioinformatics
Abstract:Abstract Background Protein-protein interactions (PPIs) are central to many biological processes. Considering that the experimental methods for identifying PPIs are time-consuming and expensive, it is important to develop automated computational methods to better predict PPIs. Various machine learning methods have been proposed, including a deep learning technique which is sequence-based that has achieved promising results. However, it only focuses on sequence information while ignoring the structural information of PPI networks. Structural information of PPI networks such as their degree, position, and neighboring nodes in a graph has been proved to be informative in PPI prediction. Results Facing the challenge of representing graph information, we introduce an improved graph representation learning method. Our model can study PPI prediction based on both sequence information and graph structure. Moreover, our study takes advantage of a representation learning model and employs a graph-based deep learning method for PPI prediction, which shows superiority over existing sequence-based methods. Statistically, Our method achieves state-of-the-art accuracy of 99.15% on Human protein reference database (HPRD) dataset and also obtains best results on Database of Interacting Protein (DIP) Human, Drosophila , Escherichia coli ( E. coli ), and Caenorhabditis elegans ( C. elegan ) datasets. Conclusion Here, we introduce signed variational graph auto-encoder (S-VGAE), an improved graph representation learning method, to automatically learn to encode graph structure into low-dimensional embeddings. Experimental results demonstrate that our method outperforms other existing sequence-based methods on several datasets. We also prove the robustness of our model for very sparse networks and the generalization for a new dataset that consists of four datasets: HPRD, E.coli , C.elegan , and Drosophila .
biochemical research methods,biotechnology & applied microbiology,mathematical & computational biology
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is **how to use graph - structure information and sequence information to predict protein - protein interactions (PPIs) more accurately**. Specifically, the researchers focus on the following points: 1. **Background problems**: Protein - protein interactions (PPIs) play a crucial role in many biological processes, such as signal transduction, immune response, cell proliferation, etc. However, although experimental methods (such as yeast two - hybrid, affinity purification, etc.) can detect PPIs, they have problems such as being time - consuming, costly, and having a high false - positive rate. Therefore, it is particularly important to develop efficient computational methods to predict PPIs. 2. **Limitations of existing methods**: - Traditional machine - learning methods mainly rely on sequence information and ignore the structural information in the PPI network (such as the degree, position, and neighbor nodes of nodes). - Although deep - learning methods perform well in feature extraction, most methods only focus on sequence data and fail to fully utilize the graph - structure information of the PPI network. 3. **Research objectives**: Propose a deep - learning method that combines graph - structure information and sequence information to predict PPIs more accurately. To this end, the author introduced an improved graph representation - learning model - **Signed Variational Graph Auto - Encoder (S - VGAE)**, by regarding the PPI network as an undirected weighted graph and combining sequence features for modeling. --- ### Formula summary The main formulas involved in the paper include those related to evaluation metrics and model architecture: #### 1. Evaluation metrics The paper uses the following formulas to measure model performance: - **Accuracy (Accuracy rate)**: $$ \text{Accuracy}=\frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}} $$ - **Sensitivity (Sensitivity)**: $$ \text{Sensitivity}=\frac{\text{TP}}{\text{TP}+\text{FN}} $$ - **Specificity (Specificity)**: $$ \text{Specificity}=\frac{\text{TN}}{\text{TN}+\text{FP}} $$ - **Precision (Precision rate)**: $$ \text{Precision}=\frac{\text{TP}}{\text{TP}+\text{FP}} $$ - **F - score (F - value)**: $$ F\text{-score}=2\cdot\frac{\text{Precision}\cdot\text{Sensitivity}}{\text{Precision}+\text{Sensitivity}} $$ Among them, $\text{TP}$, $\text{TN}$, $\text{FP}$, and $\text{FN}$ represent true positive, true negative, false positive, and false negative respectively. #### 2. Core ideas of the S - VGAE model The S - VGAE model is based on the Variational Graph Auto - Encoder (VGAE) and improves the cost function, focusing on high - confidence interaction information. Specific improvements include: - **Modify the cost function**: Only consider high - confidence interaction information. - **Assign different weights**: Assign different signs to different interactions in the adjacency matrix to enhance the influence of negative interactions. - **Classifier design**: Use a simple three - layer softmax classifier instead of the generative model for final prediction. --- ### Core contributions of the solution 1. **Combination of graph - structure and sequence information**: The S - VGAE model not only considers the graph - structure information of the PPI network (such as the degree, position, and neighbor relationships of nodes), but also combines protein sequence features, thereby improving the prediction performance. 2. *