DEEPStack-RBP: Accurate identification of RNA-binding proteins based on autoencoder feature selection and deep stacking ensemble classifier
Qinqin Wei,Qingmei Zhang,Hongli Gao,Tao Song,Adil Salhi,Bin Yu
DOI: https://doi.org/10.1016/j.knosys.2022.109875
2022-11-28
Abstract:RNA-binding proteins (RBPs) are involved in a number of biological processes such as RNA synthesis, protein folding, alternative splicing, etc. Predicting RBPs can facilitate the discovery and treatment of human diseases, such as muscle atrophy, nervous system diseases, and cancer. However, there are still various challenges in identifying RBPs using experimental methods. Computational methods, and in particular Deep Learning, are being deployed to alleviate some of these challenges and provide new avenues of investigation in the field of RBPs prediction. Here, we propose DEEPStack-RBP, a novel RBPs prediction tool based on deep learning and ensemble learning. First, conjoint triad (CT), local descriptors (LD), pseudo amino acid composition (PseAAC), multivariate mutual information (MMI) and position specific scoring matrix-transition probability composition (PSSM-TPC) are applied to extract multiple features from the proteins. Subsequently, autoencoder (AE) is used to eliminate redundancy in features, and SMOTE-ENN is employed to balance the samples by minimizing the number difference between positive and negative cases. Finally, the stacked ensemble classifier composed of bidirectional long short-term memory (BiLSTM), gated recurrent unit (GRU), and support vector machine (SVM) is used for prediction. On the training dataset RBP9873, the ACC value of DEEPStack-RBP reaches 98.76% with a MCC value of 0.9508. For the three independent test datasets of Human, S. cerevisiae and A. thaliana, the accuracy of the model is 97.16%, 97.67% and 99.57% respectively, and the MCC is 0.9405, 0.9499 and 0.9906 respectively. These results show that DEEPStack-RBP can be used as a powerful tool for RBPs prediction.
computer science, artificial intelligence