SubFeat: Feature Subspacing Ensemble Classifier for Function Prediction of DNA, RNA and Protein Sequences

H.M.Fazlul Haque,Fariha Arifin,Adilina,Muhammod Rafsanjani,Swakkhar Shatabda,Sheikh Adilina
DOI: https://doi.org/10.1101/2020.08.04.228536
2020-08-04
Abstract:Abstract The information of a cell is primarily contained in Deoxyribonucleic Acid (DNA). There is a flow of information of DNA to protein sequences via Ribonucleic acids (RNA) through transcription and translation. These entities are vital for the genetic process. Recent developments in epigenetic also show the importance of the genetic material and knowledge of their attributes and functions. However, the growth in known attributes or functionalities of these entities are still in slow progression due to the time consuming and expensive in vitro experimental methods. In this paper, we have proposed an ensemble classification algorithm called SubFeat to predict the functionalities of biological entities from different types of datasets. Our model uses a feature subspace based novel ensemble method. It divides the feature space into sub-spaces which are then passed to learn individual classifier models and the ensemble is built on this base classifiers that uses a weighted majority voting mechanism. SubFeat tested on four datasets comprising two DNA, one RNA and one protein dataset and it outperformed all the existing single classifiers and as well as the ensemble classifiers. SubFeat is made availalbe as a Python-based tool. We have made the package SubFeat available online along with a user manual. It is freely accessible from here: https://github.com/fazlulhaquejony/SubFeat .
What problem does this paper attempt to address?