lncLocPred: Predicting LncRNA Subcellular Localization Using Multiple Sequence Feature Information

Yongxian Fan,Meijun Chen,Qingqi Zhu
DOI: https://doi.org/10.1109/access.2020.3007317
IF: 3.9
2020-01-01
IEEE Access
Abstract:Determining the subcellular localization of long non-coding RNAs (lncRNAs) provides very favorable references to discover the function of lncRNAs. Instead of through time-consuming and expensive biochemical experiments, we develop a machine learning predictor based on logistic regression, lncLocPred, to predict the subcellular localization of lncRNAs. We adopt sequence features including k-mer, triplet, and PseDNC and systematically process feature selection through VarianceThreshold, binomial distribution, and F-score to obtain representative features. We observe that the top-ranked k-mers have a higher base content of G and C in the form of short repeats. Improving prediction accuracy on several subcellular localizations, our model achieves the highest overall accuracy of 92.37% on the benchmark dataset by jackknife, higher than the existing state-of-the-art predictors. Additionally, lncLocPred performs better for the prediction on an independent dataset collected by us as well. Related experimental data and source code are available at https://github.com/jademyC1221/lncLocPred.
computer science, information systems,telecommunications,engineering, electrical & electronic
What problem does this paper attempt to address?