Multi-function Prediction of Unknown Protein Sequences Using Multilabel Classifiers and Augmented Sequence Features

Saurabh Agrawal,Dilip Singh Sisodia,Naresh Kumar Nagwani
DOI: https://doi.org/10.1007/s40995-021-01134-z
2021-05-04
Abstract:Numerous protein sequences simultaneously exist at multiple subcellular localizations and exhibit multiple functions. Multi-function characterization of the Unknown Protein Sequences (UPS) is useful for analyzing multi-symptom diseases and multi-target drugs. In this work, a multisite subcellular localization model is proposed for the multi-function characterization of UPS using augmented features and algorithm adoption multilabel classifiers. Protein sequence features are augmented with physicochemical and evolutionary properties of amino acid residues as feature vectors while preserving the sequence-order-information and protein-residue-properties. Less discriminative and redundant features are discarded from the feature vector using Multilabel Linear Discriminant Analysis (MLDA). Two different multisite datasets, Gram-Positive (ML_G+) and Gram-Negative (ML_G−) are used for the experimental work, where multiple locative protein sequences with single-label are transformed into a unique multilabel protein sequence. Preprocessed feature vectors of ML_G+ and ML_G− are used separately to train multilabel-classifiers such as Decision Tree (ML_C4.5), K-Nearest Neighbor (ML_kNN), Multi Layer Perceptron (MLP), Extra Tree (ExTr) and Random Forest (RF) using fivefold cross-validation. After that validated multisite model has been utilized for the prediction of single as well as multiple functions of the UPS. The model achieved an accuracy of 94.23% for ML_G+ and 82.77% with ML_G− through known protein sequences using MLP, while for UPS accuracy is 77.50% for ML_G+ using MLP and ExTr, and 54.28% for ML_G− through ML_kNN.
multidisciplinary sciences
What problem does this paper attempt to address?