ItLnc-BXE: A Bagging-XGBoost-Ensemble Method With Comprehensive Sequence Features for Identification of Plant lncRNAs

Guangyan Zhang,Ziru Liu,Jichen Dai,Zilan Yu,Shuai Liu,Wen Zhang
DOI: https://doi.org/10.1109/ACCESS.2020.2985114
IF: 3.9
2020-01-01
IEEE Access
Abstract:Since long non-coding RNAs (lncRNAs) have involved in a wide range of functions in cellular and developmental processes, an increasing number of methods have been proposed for distinguishing lncRNAs from coding RNAs. However, most of the existing methods are designed for lncRNAs in animal systems, and only a fewmethods focus on the plant lncRNAidentification. Different from lncRNAs in animal systems, plant lncRNAs have distinct characteristics. It is desirable to develop a computational method for accurate and robust identification of plant lncRNAs. Herein, we present a plant lncRNAidentification method ItLnc-BXE, which utilizes comprehensive features and the ensemble learning strategy. First, a diversity of sequence features is collected and filtered by feature selection to represent transcripts. Then, several base learners are trained and further combined into a single meta-learner by ensemble learning, and thus an ItLnc-BXE model is constructed. ItLnc-BXE models are evaluated on datasets of six plant species, the results show that ItLnc-BXE outperforms other state-of-the-art plant lncRNA identification methods, achieving better and robust performance (AUC >95.91%). We also perform some experiments about cross-species lncRNA identification, and the results indicate that dicots-based and monocots-based models can be used to accurately identify lncRNAs in lower plant species, such as mosses and algae. In addition, source codes and supplementary data are available at https://github.com/BioMedicalBigDataMiningLab/ItLnc-BXE.
What problem does this paper attempt to address?