Abstract:Background Prediction of protein subcellular localization generally involves many complex factors, and using only one or two aspects of data information may not tell the true story. For this reason, some recent predictive models are deliberately designed to integrate multiple heterogeneous data sources for exploiting multi-aspect protein feature information. Gene ontology, hereinafter referred to as GO , uses a controlled vocabulary to depict biological molecules or gene products in terms of biological process, molecular function and cellular component. With the rapid expansion of annotated protein sequences, gene ontology has become a general protein feature that can be used to construct predictive models in computational biology. Existing models generally either concatenated the GO terms into a flat binary vector or applied majority-vote based ensemble learning for protein subcellular localization, both of which can not estimate the individual discriminative abilities of the three aspects of gene ontology. Results In this paper, we propose a Gene Ontology Based Transfer Learning Model ( GO-TLM ) for large-scale protein subcellular localization. The model transfers the signature-based homologous GO terms to the target proteins, and further constructs a reliable learning system to reduce the adverse affect of the potential false GO terms that are resulted from evolutionary divergence. We derive three GO kernels from the three aspects of gene ontology to measure the GO similarity of two proteins, and derive two other spectrum kernels to measure the similarity of two protein sequences. We use simple non-parametric cross validation to explicitly weigh the discriminative abilities of the five kernels, such that the time & space computational complexities are greatly reduced when compared to the complicated semi-definite programming and semi-indefinite linear programming. The five kernels are then linearly merged into one single kernel for protein subcellular localization. We evaluate GO-TLM performance against three baseline models: MultiLoc, MultiLoc-GO and Euk-mPLoc on the benchmark datasets the baseline models adopted. 5-fold cross validation experiments show that GO-TLM achieves substantial accuracy improvement against the baseline models: 80.38% against model Euk-mPLoc 67.40% with 12.98% substantial increase; 96.65% and 96.27% against model MultiLoc-GO 89.60% and 89.60%, with 7.05% and 6.67% accuracy increase on dataset MultiLoc plant and dataset MultiLoc animal , respectively; 97.14%, 95.90% and 96.85% against model MultiLoc-GO 83.70%, 90.10% and 85.70%, with accuracy increase 13.44% , 5.8% and 11.15% on dataset BaCelLoc plant , dataset BaCelLoc fungi and dataset BaCelLoc animal respectively. For BaCelLoc independent sets, GO-TLM achieves 81.25%, 80.45% and 79.46% on dataset BaCelLoc plant holdout , dataset BaCelLoc plant holdout and dataset BaCelLoc animal holdout , respectively, as compared against baseline model MultiLoc-GO 76%, 60.00% and 73.00%, with accuracy increase 5.25% , 20.45% and 6.46% , respectively. Conclusions Since direct homology-based GO term transfer may be prone to introducing noise and outliers to the target protein, we design an explicitly weighted kernel learning system (called Gene Ontology Based Transfer Learning Model, GO-TLM ) to transfer to the target protein the known knowledge about related homologous proteins, which can reduce the risk of outliers and share knowledge between homologous proteins, and thus achieve better predictive performance for protein subcellular localization. Cross validation and independent test experimental results show that the homology-based GO term transfer and explicitly weighing the GO kernels substantially improve the prediction performance.

Species-agnostic transfer learning for cross-species transcriptomics data integration without gene orthology

Transfer Learning Efficiently Maps Bone Marrow Cell Types from Mouse to Human Using Single-Cell RNA Sequencing

Within- and Cross-Species Predictions of Plant Specialized Metabolism Genes Using Transfer Learning.

Computational strategies for cross-species knowledge transfer and translational biomedicine

Cell type matching across species using protein embeddings and transfer learning

Transfer Learning Of Gene Expression Using Reactome

Biologically Informed Deep Learning to Infer Gene Program Activity in Single Cells

Generative modeling and latent space arithmetics predict single-cell perturbation response across cell types, studies and species

Multigrate: single-cell multi-omic data integration

Integration and transfer learning of single-cell transcriptomes via cFIT

Cross-Species Protein Function Prediction with Asynchronous-Random Walk

OrthologAL: A Shiny application for quality-aware humanization of non-human pre-clinical high-dimensional gene expression data

Sctab: Scaling Cross-Tissue Single-Cell Annotation Models

Transfer learning for cross-context prediction of protein expression from 5'UTR sequence

Gene ontology based transfer learning for protein subcellular localization

Joint representation of molecular networks from multiple species improves gene classification

Multi-modal Transfer Learning between Biological Foundation Models

AutoTransOP: translating omics signatures without orthologue requirements using deep learning

Adversarial learning enables unbiased organism-wide cross-species alignment of single-cell RNA data at scale

Benchmarking strategies for cross-species integration of single-cell RNA sequencing data