Abstract:Protein Secondary Structural Class (PSSC) information is important in investigating further challenges of protein sequences like protein fold recognition, protein tertiary structure prediction, and analysis of protein functions for drug discovery. Identification of PSSC using biological methods is time-consuming and cost-intensive. Several computational models have been developed to predict the structural class; however, they lack in generalization of the model. Hence, predicting PSSC based on protein sequences is still proving to be an uphill task. In this article, we proposed an effective, novel and generalized prediction model consisting of a feature modeling and an ensemble of classifiers. The proposed feature modeling extracts discriminating information (features) by leveraging three techniques: (i) Embedding - features are extracted on the basis of spatial residue arrangements of the sequences using word embedding approaches; (ii) SkipXGram Bi-gram - various sets of skipped bi-gram features are extracted from the sequences; and (iii) General Statistical (GS) based features are extracted which covers the global information of structural sequences. The combined effective sets of features are trained and classified using an ensemble of three classifiers: Support Vector Machine (SVM), Random Forest (RF), and Gradient Boosting Machines (GBM). The proposed model when assessed on five benchmark datasets (high and low sequence similarity), viz. z277, z498, 25PDB, 1189, and FC699, reported an overall accuracy of 93.55, 97.58, 81.82, 81.11, and 93.93 percent respectively. The proposed model is further validated on a large-scale updated low similarity ( ≤ 25%) dataset, where it achieved an overall accuracy of 81.11 percent. The proposed generalized model is robust and consistently outperformed several state-of-the-art models on all the five benchmark datasets.

Using an Ensemble of Support Vector Machine Classifiers to Predict Protein Supersecondary Structural Motifs.

Prediction of Functional Class of Proteins and Peptides Irrespective of Sequence Homology by Support Vector Machines.

Supersecondary Structure Prediction Using Chou's Pseudo Amino Acid Composition

An Ensemble Classifier of Support Vector Machines Used to Predict Protein Structural Classes by Fusing Auto Covariance and Pseudo-Amino Acid Composition

Using Pseudo-Amino Acid Composition and Support Vector Machine to Predict Protein Structural Class.

Predicting Protein Secondary Structure by a Support Vector Machine Based on a New Coding Scheme.

Prediction of Protein Secondary Structure Content Using Support Vector Machine

A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach.

Accurate Prediction of Protein Structural Classes by Incorporating Predicted Secondary Structure Information into the General Form of Chou's Pseudo Amino Acid Composition.

Prediction of Protein Supersecondary Structures Based on the Artificial Neural Network Method

A Protein Secondary Structure Prediction Framework Based on the Support Vector Machine

Improved Method for Predicting Protein Fold Patterns with Ensemble Classifiers.

Predicting Protein Quaternary Structure With Multi-Scale Energy Of Amino Acid Factor Solution Scores And Their Combination

Prediction of Protein Secondary Structure Using Feature Selection and Analysis Approach

Recent Advances in Computational Prediction of Secondary and Supersecondary Structures from Protein Sequences

Prediction of Protein Secondary Structure Content by Using the Concept of Chou'S Pseudo Amino Acid Composition and Support Vector Machine

Prediction of protein domains from sequence information using support vector machines

Enhanced Protein Structural Class Prediction Using Effective Feature Modeling and Ensemble of Classifiers

A Data Mining Approach to Predict Protein Secondary Structure

A Seqlet-Based Maximum Entropy Markov Approach for Protein Secondary Structure Prediction

Protein Structural Class Prediction Based on Distance-related Statistical Features from Graphical Representation of Predicted Secondary Structure