NetGO: Improving Large-scale Protein Function Prediction with Massive Network Information
Ronghui You,Shuwei Yao,Xiaodi Huang,Fengzhu Sun,Hiroshi Mamitsuka,Shanfeng Zhu
DOI: https://doi.org/10.1101/439554
IF: 14.9
2018-01-01
Nucleic Acids Research
Abstract:Automated function prediction (AFP) of proteins is of great significance in biology. In essence, AFP is a large-scale multi-label classification over pairs of proteins and GO terms. Existing AFP approaches, however, have their limitations on both sides of proteins and GO terms. Using various sequence information and the robust learning to rank (LTR) framework, we have developed GOLabeler, a state-of-the-art approach of CAFA3, which overcomes the limitation of the GO term side, such as imbalanced GO terms. Unfortunately, for the protein side issue, available abundant protein information, except for sequences, have not been effectively used for large-scale AFP in CAFA. We propose NetGO that is able to improve large-scale AFP with massive network information. The novelties of NetGO have threefold in using network information: 1) the powerful LTR framework of NetGO efficiently and effectively integrates both sequence and network information, which can easily make large-scale AFP; 2) NetGO can use whole and massive network information of all species (>2000) in STRING (other than only high confidence links and/or some specific species); and 3) NetGO can still use network information to annotate a protein by homology transfer even if it is not covered in STRING. Under numerous experimental settings, we examined the performance of NetGO, such as general performance comparison, species-specific prediction, and prediction on difficult proteins, by using training and test data separated by time-delayed settings of CAFA. Experimental results have clearly demonstrated that NetGO outperforms GOLabeler, DeepGO, and other compared baseline methods significantly. In addition, several interesting findings from our experiments on NetGO would be useful for future AFP research.