Abstract:Background Prediction of protein subcellular localization generally involves many complex factors, and using only one or two aspects of data information may not tell the true story. For this reason, some recent predictive models are deliberately designed to integrate multiple heterogeneous data sources for exploiting multi-aspect protein feature information. Gene ontology, hereinafter referred to as GO , uses a controlled vocabulary to depict biological molecules or gene products in terms of biological process, molecular function and cellular component. With the rapid expansion of annotated protein sequences, gene ontology has become a general protein feature that can be used to construct predictive models in computational biology. Existing models generally either concatenated the GO terms into a flat binary vector or applied majority-vote based ensemble learning for protein subcellular localization, both of which can not estimate the individual discriminative abilities of the three aspects of gene ontology. Results In this paper, we propose a Gene Ontology Based Transfer Learning Model ( GO-TLM ) for large-scale protein subcellular localization. The model transfers the signature-based homologous GO terms to the target proteins, and further constructs a reliable learning system to reduce the adverse affect of the potential false GO terms that are resulted from evolutionary divergence. We derive three GO kernels from the three aspects of gene ontology to measure the GO similarity of two proteins, and derive two other spectrum kernels to measure the similarity of two protein sequences. We use simple non-parametric cross validation to explicitly weigh the discriminative abilities of the five kernels, such that the time & space computational complexities are greatly reduced when compared to the complicated semi-definite programming and semi-indefinite linear programming. The five kernels are then linearly merged into one single kernel for protein subcellular localization. We evaluate GO-TLM performance against three baseline models: MultiLoc, MultiLoc-GO and Euk-mPLoc on the benchmark datasets the baseline models adopted. 5-fold cross validation experiments show that GO-TLM achieves substantial accuracy improvement against the baseline models: 80.38% against model Euk-mPLoc 67.40% with 12.98% substantial increase; 96.65% and 96.27% against model MultiLoc-GO 89.60% and 89.60%, with 7.05% and 6.67% accuracy increase on dataset MultiLoc plant and dataset MultiLoc animal , respectively; 97.14%, 95.90% and 96.85% against model MultiLoc-GO 83.70%, 90.10% and 85.70%, with accuracy increase 13.44% , 5.8% and 11.15% on dataset BaCelLoc plant , dataset BaCelLoc fungi and dataset BaCelLoc animal respectively. For BaCelLoc independent sets, GO-TLM achieves 81.25%, 80.45% and 79.46% on dataset BaCelLoc plant holdout , dataset BaCelLoc plant holdout and dataset BaCelLoc animal holdout , respectively, as compared against baseline model MultiLoc-GO 76%, 60.00% and 73.00%, with accuracy increase 5.25% , 20.45% and 6.46% , respectively. Conclusions Since direct homology-based GO term transfer may be prone to introducing noise and outliers to the target protein, we design an explicitly weighted kernel learning system (called Gene Ontology Based Transfer Learning Model, GO-TLM ) to transfer to the target protein the known knowledge about related homologous proteins, which can reduce the risk of outliers and share knowledge between homologous proteins, and thus achieve better predictive performance for protein subcellular localization. Cross validation and independent test experimental results show that the homology-based GO term transfer and explicitly weighing the GO kernels substantially improve the prediction performance.

Fine-tuning Protein Embeddings for Functional Similarity Evaluation

Embeddings from deep learning transfer GO annotations beyond homology

OntoProtein: Protein Pretraining With Gene Ontology Embedding

Partial order relation–based gene ontology embedding improves protein function prediction

Evaluation of GO-based Functional Similarity Measures Using S. Cerevisiae Protein Interaction and Expression Profile Data

Comparative Analysis of Unsupervised Protein Similarity Prediction Based on Graph Embedding

Boosting the Predictive Power of Protein Representations with a Corpus of Text Annotations

Benchmarking text-integrated protein language model embeddings and embedding fusion on diverse downstream tasks

Neural Embeddings for Protein Graphs

Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods

Fixed-Length Protein Embeddings using Contextual Lenses

Fine-tuning protein language models boosts predictions across diverse tasks

Semantical and Geometrical Protein Encoding Toward Enhanced Bioactivity and Thermostability

Protein Function Prediction With Functional and Topological Knowledge of Gene Ontology

PROTGOAT : Improved automated protein function predictions using Protein Language Models

An integrative approach to protein sequence design through multiobjective optimization

Modeling the language of life – Deep Learning Protein Sequences

Gene ontology based transfer learning for protein subcellular localization

Protein function prediction as approximate semantic entailment

TransformerGO: predicting protein-protein interactions by modelling the attention between sets of gene ontology terms

Codon language embeddings provide strong signals for use in protein engineering