Abstract:Background: Proteins play a crucial role in life activities, such as catalyzing metabolic reactions, DNA replication, responding to stimuli, etc. Identification of protein structures and functions are critical for both basic research and applications. Because the traditional experiments for studying the structures and functions of proteins are expensive and time consuming, computational approaches are highly desired. In key for computational methods is how to efficiently extract the features from the protein sequences. During the last decade, many powerful feature extraction algorithms have been proposed, significantly promoting the development of the studies of protein structures and functions. Objective: To help the researchers to catch up the recent developments in this important field, in this study, an updated review is given, focusing on the sequence-based feature extractions of protein sequences. Method: These sequence-based features of proteins were grouped into three categories, including composition-based features, autocorrelation-based features and profile-based features. The detailed information of features in each group was introduced, and their advantages and disadvantages were discussed. Besides, some useful tools for generating these features will also be introduced. Results: Generally, autocorrelation-based features outperform composition-based features, and profile-based features outperform autocorrelation-based features. The reason is that profile-based features consider the evolutionary information, which is useful for identification of protein structures and functions. However, profile-based features are more time consuming, because the multiple sequence alignment process is required. Conclusion: In this study, some recently proposed sequence-based features were introduced and discussed, such as basic k-mers, PseAAC, auto-cross covariance, top-n-gram etc. These features did make great contributions to the developments of protein sequence analysis. Future studies can be focus on exploring the combinations of these features. Besides, techniques from other fields, such as signal processing, natural language process (NLP), image processing etc., would also contribute to this important field, because natural languages (such as English) and protein sequences share some similarities. Therefore, the proteins can be treated as documents, and the features, such as k-mers, top-n-grams, motifs, can be treated as the words in the languages. Techniques from these filed will give some new ideas and strategies for extracting the features from proteins.

Protein Sequence Comparison Based on K-string Dictionary

A Multiple Criteria Framework for 3D Protein Structure Similarity Retrieval

Use of 2D FFT and DTW in Protein Sequence Comparison

A Similarity Computing Algorithm for Proteins

DNA sequence comparison by a novel probabilistic method

A novel fast vector method for genetic sequence comparison

A Parallel Implementation for Large-Scale TSR-based 3D Structural Comparisons of Protein and Amino Acid

Comparing protein structures and inferring functions with a novel three-dimensional Yau-Hausdorff method.

Nucleotide Amino Acid K-Mer Vector: an Alignment-Free Method for Comparing Genomic Sequences

Two Dimensional Yau-Hausdorff Distance with Applications on Comparison of DNA and Protein Sequences

Phylogenetic Analysis of Protein Sequences Based on a Novel K-Mer Natural Vector Method

Statistical Inference of a canonical dictionary of protein substructural fragments

A Sequence-Based Evolutionary Distance Method for Phylogenetic Analysis of Highly Divergent Proteins.

FermatS: A Novel Numerical Representation for Protein Sequence Comparison and DNA-binding Protein Identification.

K-Mer Sparse Matrix Model for Genetic Sequence and Its Applications in Sequence Comparison.

Sequence alignment using large protein structure alphabets improves sensitivity to remote homologs

Phylogenetic Profiles as a Unified Framework for Measuring Protein Structure, Function and Evolution

Rapid multiple protein sequence search by parallel and heterogeneous computation

A Novel Alignment-Free Vector Method to Cluster Protein Sequences

Brainstorming through the Sequence Universe: Theories on the Protein Problem

A Review on the Recent Developments of Sequence-based Protein Feature Extraction Methods