Abstract:The prediction of protein subcellular localization is critical for inferring protein functions, gene regulations and protein-protein interactions. With the advances of high-throughput sequencing technologies and proteomic methods, the protein sequences of numerous yeasts have become publicly available, which enables us to computationally predict yeast protein subcellular localization. However, widely-used protein sequence representation techniques, such as amino acid composition and the Chou's pseudo amino acid composition (PseAAC), are difficult in extracting adequate information about the interactions between residues and position distribution of each residue. Therefore, it is still urgent to develop novel sequence representations. In this study, we have presented two novel protein sequence representation techniques including Generalized Chaos Game Representation (GCGR) based on the frequency and distributions of the residues in the protein primary sequence, and novel statistics and information theory (NSI) reflecting local position information of the sequence. In the GCGR + NSI representation, a protein primary sequence is simply represented by a 5-dimensional feature vector, while other popular methods like PseAAC and dipeptide adopt features of more than hundreds of dimensions. In practice, the feature representation is highly efficient in predicting protein subcellular localization. Even without using machine learning-based classifiers, a simple model based on the feature vector can achieve prediction accuracies of 0.8825 and 0.7736 respectively for the CL317 and ZW225 datasets. To further evaluate the effectiveness of the proposed encoding schemes, we introduce a multi-view features-based method to combine the two above-mentioned features with other well-known features including PseAAC and dipeptide composition, and use support vector machine as the classifier to predict protein subcellular localization. This novel model achieves prediction accuracies of 0.927 and 0.871 respectively for the CL317 and ZW225 datasets, better than other existing methods in the jackknife tests. The results suggest that the GCGR and NSI features are useful complements to popular protein sequence representations in predicting yeast protein subcellular localization. Finally, we validate a few newly predicted protein subcellular localizations by evidences from some published articles in authority journals and books.

Improving the generalization of protein expression models with mechanistic sequence information

Accuracy and data efficiency in deep learning models of protein expression

Evolutionary context-integrated deep sequence modeling for protein engineering

Adaptive machine learning for protein engineering

Generative models for protein sequence modeling: recent advances and future directions

Improving accuracy of protein-protein interaction prediction by considering the converse problem for sequence representation

Deep Learning Prediction of Ribosome Profiling with Translatomer Reveals Translational Regulation and Interprets Disease Variants

Automated characterization and analysis of expression compatibility between regulatory sequences and metabolic genes in Escherichia coli

Predicting and Interpreting Protein Developability Via Transfer of Convolutional Sequence Representation

Machine learning to predict continuous protein properties from binary cell sorting data and map unseen sequence space

Enhanced Performance of Gene Expression Predictive Models with Protein-Mediated Spatial Chromatin Interactions

Codon language embeddings provide strong signals for use in protein engineering

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

MPEPE, a predictive approach to improve protein expression in E. coli based on deep learning

A deep auto-encoder model for gene expression prediction

Prediction of Protein Subcellular Localization Based on Fusion of Multi-view Features

Inferring gene regulatory networks from single-cell data: a mechanistic approach

Unifying Sequences, Structures, and Descriptions for Any-to-Any Protein Generation with the Large Multimodal Model HelixProtX

Transfer learning for cross-context prediction of protein expression from 5'UTR sequence

Integration of protein and coding sequences enables mutual augmentation of the language model

Semantical and Geometrical Protein Encoding Toward Enhanced Bioactivity and Thermostability