Abstract:The prediction of protein subcellular localization is critical for inferring protein functions, gene regulations and protein-protein interactions. With the advances of high-throughput sequencing technologies and proteomic methods, the protein sequences of numerous yeasts have become publicly available, which enables us to computationally predict yeast protein subcellular localization. However, widely-used protein sequence representation techniques, such as amino acid composition and the Chou's pseudo amino acid composition (PseAAC), are difficult in extracting adequate information about the interactions between residues and position distribution of each residue. Therefore, it is still urgent to develop novel sequence representations. In this study, we have presented two novel protein sequence representation techniques including Generalized Chaos Game Representation (GCGR) based on the frequency and distributions of the residues in the protein primary sequence, and novel statistics and information theory (NSI) reflecting local position information of the sequence. In the GCGR + NSI representation, a protein primary sequence is simply represented by a 5-dimensional feature vector, while other popular methods like PseAAC and dipeptide adopt features of more than hundreds of dimensions. In practice, the feature representation is highly efficient in predicting protein subcellular localization. Even without using machine learning-based classifiers, a simple model based on the feature vector can achieve prediction accuracies of 0.8825 and 0.7736 respectively for the CL317 and ZW225 datasets. To further evaluate the effectiveness of the proposed encoding schemes, we introduce a multi-view features-based method to combine the two above-mentioned features with other well-known features including PseAAC and dipeptide composition, and use support vector machine as the classifier to predict protein subcellular localization. This novel model achieves prediction accuracies of 0.927 and 0.871 respectively for the CL317 and ZW225 datasets, better than other existing methods in the jackknife tests. The results suggest that the GCGR and NSI features are useful complements to popular protein sequence representations in predicting yeast protein subcellular localization. Finally, we validate a few newly predicted protein subcellular localizations by evidences from some published articles in authority journals and books.

Learning protein multi-view features in complex space

Enhancing membrane protein subcellular localization prediction by parallel fusion of multi-view features.

Protein Complex Detection Based on Partially Shared Multi-View Clustering

Predicting Protein Structural Class Based on Multi-Features Fusion

Prediction of Protein Structural Classes Based on Voting Fusion Algorithm

Prediction of Seven Protein Structural Classes by Fusing Multi-Feature Information Including Protein Evolutionary Conservation Information

Protein Fold Prediction Based on Multi-Feature Fusion

MMSMAPlus: a multi-view multi-scale multi-attention embedding model for protein function prediction

Predicting Protein Structural Class with Pseudo-Amino Acid Composition and Support Vector Machine Fusion Network.

Protein Fold Recognition based on Multi-view Modeling.

An Ensemble Classifier of Support Vector Machines Used to Predict Protein Structural Classes by Fusing Auto Covariance and Pseudo-Amino Acid Composition

Enhancing Drug Peptide Sequence Prediction Using Multi-view Feature Fusion Learning

Predicting Protein Quaternary Structure With Multi-Scale Energy Of Amino Acid Factor Solution Scores And Their Combination

Integration of Multi-Feature Fusion and PLS-DA in Protein Secondary Structure Prediction

An Efficient Feature Extraction Technique Based on Local Coding PSSM and Multifeatures Fusion for Predicting Protein-Protein Interactions

Prediction of Protein Subcellular Localization Based on Fusion of Multi-view Features

MLDH-Fold: Protein Fold Recognition Based on Multi-View Low-Rank Modeling

Prediction of protein-ATP binding residues using multi-view feature learning via contextual-based co-attention network

Protein Function Prediction Using Multi-label Learning and ISOMAP Embedding.

Protein Structural Class Prediction Based on Multi-Feature and Ma-Ada Multi-Classifier Fusion

A Protein Structure Prediction Approach Leveraging Transformer and CNN Integration