KD-KLNMF: Identification of lncRNAs subcellular localization with multiple features and nonnegative matrix factorization

Shengli Zhang,Huijuan Qiao
DOI: https://doi.org/10.1016/j.ab.2020.113995
IF: 2.9
2020-12-01
Analytical Biochemistry
Abstract:<p>Long non-coding RNAs (lncRNAs) refer to functional RNA molecules with a length more than 200 nucleotides and have minimal or no function to encode proteins. In recent years, more studies show that lncRNAs subcellular localization has valuable clues for their biological functions. So it is count for much to identify lncRNAs subcellular localization. In this paper, a novel statistical model named KD-KLNMF is constructed to predict lncRNAs subcellular localization. Firstly, <em>k</em>-mer and dinucleotide-based spatial autocorrelation are incorporated as the feature vector. Then, Synthetic Minority Over-sampling Technique is used to deal with the imbalance dataset. Next, Kullback-Leibler divergence-based nonnegative matrix factorization is applied to select optimal features. And then we utilize support vector machine as the classifier after comparing with other classifiers. Finally, the jackknife test is performed to evaluate the model. The overall accuracies reach 97.24% and 92.86% on training dataset and independent dataset, respectively. The results are better than the previous methods, which indicate that our model will be a useful and feasible tool to identify lncRNAs subcellular localization. The datasets and source code are freely available at <a href="https://github.com/HuijuanQiao/KD-KLNMF">https://github.com/HuijuanQiao/KD-KLNMF</a>.</p>
biochemistry & molecular biology,biochemical research methods,chemistry, analytical
What problem does this paper attempt to address?