IDDLncLoc: Subcellular Localization of LncRNAs Based on a Framework for Imbalanced Data Distributions
Wang Yan,Zhu Xiaopeng,Yang Lili,Hu Xuemei,He Kai,Yu Cuinan,Jiao Shaoqing,Chen Jiali,Guo Rui,Yang Sen
DOI: https://doi.org/10.1007/s12539-021-00497-6
2022-01-01
Interdisciplinary Sciences Computational Life Sciences
Abstract:Long non-coding RNAs play a crucial role in many life processes of cell, such as genetic markers, RNA splicing, signaling, and protein regulation. Considering that identifying lncRNA's localization in the cell through experimental methods is complicated, hard to reproduce, and expensive, we propose a novel method named IDDLncLoc in this paper, which adopts an ensemble model to solve the problem of the subcellular localization. In the proposal model, dinucleotide-based autocross covariance features, k-mer nucleotide composition features, and composition, transition, and distribution features are introduced to encode a raw RNA sequence to vector. To screen out reliable features, feature selection through binomial distribution, and recursive feature elimination is employed. Furthermore, strategies of oversampling in mini-batch, random sampling, and stacking ensemble strategies are customized to overcome the problem of data imbalance on the benchmark dataset. Finally, compared with the latest methods, IDDLncLoc achieves an accuracy of 94.96% on the benchmark dataset, which is 2.59% higher than the best method, and the results further demonstrate IDDLncLoc is excellent on the subcellular localization of lncRNA. Besides, a user-friendly web server is available at http://lncloc.club.