Abstract:The prediction of protein subcellular localization is critical for inferring protein functions, gene regulations and protein-protein interactions. With the advances of high-throughput sequencing technologies and proteomic methods, the protein sequences of numerous yeasts have become publicly available, which enables us to computationally predict yeast protein subcellular localization. However, widely-used protein sequence representation techniques, such as amino acid composition and the Chou's pseudo amino acid composition (PseAAC), are difficult in extracting adequate information about the interactions between residues and position distribution of each residue. Therefore, it is still urgent to develop novel sequence representations. In this study, we have presented two novel protein sequence representation techniques including Generalized Chaos Game Representation (GCGR) based on the frequency and distributions of the residues in the protein primary sequence, and novel statistics and information theory (NSI) reflecting local position information of the sequence. In the GCGR + NSI representation, a protein primary sequence is simply represented by a 5-dimensional feature vector, while other popular methods like PseAAC and dipeptide adopt features of more than hundreds of dimensions. In practice, the feature representation is highly efficient in predicting protein subcellular localization. Even without using machine learning-based classifiers, a simple model based on the feature vector can achieve prediction accuracies of 0.8825 and 0.7736 respectively for the CL317 and ZW225 datasets. To further evaluate the effectiveness of the proposed encoding schemes, we introduce a multi-view features-based method to combine the two above-mentioned features with other well-known features including PseAAC and dipeptide composition, and use support vector machine as the classifier to predict protein subcellular localization. This novel model achieves prediction accuracies of 0.927 and 0.871 respectively for the CL317 and ZW225 datasets, better than other existing methods in the jackknife tests. The results suggest that the GCGR and NSI features are useful complements to popular protein sequence representations in predicting yeast protein subcellular localization. Finally, we validate a few newly predicted protein subcellular localizations by evidences from some published articles in authority journals and books.

PScL-2LSAESM: bioimage-based prediction of protein subcellular localization by integrating heterogeneous features with the two-level SAE-SM and mean ensemble method

Prediction of Human Protein Subcellular Localization Using Deep Learning

PScL-HDeep: Image-Based Prediction of Protein Subcellular Location in Human Tissue Using Ensemble Learning of Handcrafted and Deep Learned Features with Two-Layer Feature Selection.

LncLSTA: A Versatile Predictor Unveiling Subcellular Localization of Lncrnas Through Long-Short Term Attention

Prediction of Protein Subcellular Localization with a Novel Method: Sequence-segmented PSEAAC

Psi: A Comprehensive And Integrative Approach For Accurate Plant Subcellular Localization Prediction

PScL-DDCFPred: an ensemble deep learning-based approach for characterizing multiclass subcellular localization of human proteins from bioimage data

An Artificial Intelligence-Based Stacked Ensemble Approach for Prediction of Protein Subcellular Localization in Confocal Microscopy Images

Bioimage-Based Prediction of Protein Subcellular Location in Human Tissue with Ensemble Features and Deep Networks.

Predicting subcellular localization of multisite proteins using differently weighted multi-label k-nearest neighbors sets.

SCLpred-EMS: subcellular localization prediction of endomembrane system and secretory pathway proteins by Deep N-to-1 Convolutional Neural Networks

Human Protein Subcellular Localization with Integrated Source and Multi-label Ensemble Classifier.

Prediction of protein subcellular localization by support vector machines using multi-scale energy and pseudo amino acid composition

Protein Subcellular Localization Prediction Based on PSI-BLAST Profile and Principal Component Analysis

Protein Subcellular Localization Based on PSI-BLAST and Machine Learning.

A New Hybrid Approach to Predict Subcellular Localization by Incorporating Protein Evolutionary Conservation Information

MSLP: mRNA subcellular localization predictor based on machine learning techniques

Single-cell Subcellular Protein Localisation Using Novel Ensembles of Diverse Deep Architectures

Prediction of Protein Subcellular Localization Based on Fusion of Multi-view Features

Deep Model-Based Feature Extraction for Predicting Protein Subcellular Localizations from Bio-Images.

Learning Protein Subcellular Localization Multi-View Patterns from Heterogeneous Data of Imaging, Sequence and Networks