Abstract:MOTIVATION: Despite advances in the gene annotation process, the functions of a large portion of gene products remain insufficiently characterized. In addition, the in silico prediction of novel Gene Ontology (GO) annotations for partially characterized gene functions or processes is highly dependent on reverse genetic or functional genomic approaches. To our knowledge, no prediction method has been demonstrated to be highly accurate for sparsely annotated GO terms (those associated to fewer than 10 genes).RESULTS: We propose a novel approach, information theory-based semantic similarity (ITSS), to automatically predict molecular functions of genes based on existing GO annotations. Using a 10-fold cross-validation, we demonstrate that the ITSS algorithm obtains prediction accuracies (precision 97%, recall 77%) comparable to other machine learning algorithms when compared in similar conditions over densely annotated portions of the GO datasets. This method is able to generate highly accurate predictions in sparsely annotated portions of GO, where previous algorithms have failed. As a result, our technique generates an order of magnitude more functional predictions than previous methods. A 10-fold cross validation demonstrated a precision of 90% at a recall of 36% for the algorithm over sparsely annotated networks of the recent GO annotations (about 1400 GO terms and 11,000 genes in Homo sapiens). To our knowledge, this article presents the first historical rollback validation for the predicted GO annotations, which may represent more realistic conditions than more widely used cross-validation approaches. By manually assessing a random sample of 100 predictions conducted in a historical rollback evaluation, we estimate that a minimum precision of 51% (95% confidence interval: 43-58%) can be achieved for the human GO Annotation file dated 2003.AVAILABILITY: The program is available on request. The 97,732 positive predictions of novel gene annotations from the 2005 GO Annotation dataset and other supplementary information is available at http://phenos.bsd.uchicago.edu/ITSS/.SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Globally predicting protein functions based on co-expressed protein-protein interaction networks and ontology taxonomy similarities.

Widely Predicting Specific Protein Functions Based on Protein-Protein Interaction Data and Gene Expression Profile

Broadly predicting specific gene functions with expression similarity and taxonomy similarity.

Protein Function Prediction With Functional and Topological Knowledge of Gene Ontology

Mapping Gene Ontology to Proteins Based on Protein-Protein Interaction Data

Combining the Interactive Information to Further Predict Protein Function of Saccharomyces Cerevisiae

Function Prediction For Hypothetical Proteins In Yeast Saccharomyces Cerevisiae Using Multiple Sources Of High-Throughput Data

Information theory applied to the sparse gene ontology annotation network to predict novel gene function

Prediction of Yeast Protein-Protein Interaction Network: Insights from the Gene Ontology and Annotations

Prediction of Protein Function Using Protein-Protein Interaction Data

MetaGO: Predicting Gene Ontology of Non-homologous Proteins Through Low-Resolution Protein Structure Prediction and Protein-Protein Network Mapping.

Gene function prediction with knowledge from gene ontology

Predicting gene ontology functions from protein's regional surface structures

Global protein function prediction in protein-protein interaction networks

Predicting Protein Function Via Semantic Integration of Multiple Networks

Prot2GO: Predicting GO Annotations from Protein Sequences and Interactions.

Global Propagation Method for Predicting Protein Function by Integrating Multiple Data Sources

Protein function prediction as approximate semantic entailment

Predicting Protein Function Based on the Topological Structure of Protein Interaction Networks

HNetGO: Protein Function Prediction Via Heterogeneous Network Transformer.

A Novel Network-Based Algorithm for Predicting Protein-Protein Interactions Using Gene Ontology