Non-coding RNA identification with pseudo RNA sequences and feature representation learning
Xian-Gan Chen,Xiaofei Yang,Chenhong Li,Xianguang Lin,Wen Zhang,Xian-gan Chen
DOI: https://doi.org/10.1016/j.compbiomed.2023.107355
IF: 7.7
2023-08-28
Computers in Biology and Medicine
Abstract:Distinguishing non-coding RNAs (ncRNAs) from coding RNAs is very important in bioinformatics. Although many methods have been proposed for solving this task, it remains highly challenging to further improve the accuracy of ncRNA identification. In this paper, we propose a coding potential predictor using feature representation learning based on pseudo RNA sequences named CPPFLPS. In this method, we use the pseudo RNA sequences generated by simulating RNA sequence mutations as new samples for data augmentation, and six string operations simulating RNA sequence mutations are considered: base replacement, base insertion, base deletion, subsequence reversion, subsequence repetition and subsequence deletion. In the feature representation learning framework, different types of pseudo RNA sequences are added to the training set to form new training sets that can be used to train baseline classifiers, thus obtaining baseline models. The resulting labels of these baseline models are used as feature vectors to represent RNA sequences, and the resulting feature vectors acquired after feature selection are used to train a predictive model for distinguishing ncRNAs from coding RNAs. Our method achieves better performance compared with that of existing state-of-the-art methods. The implementation of the proposed method is available at https://github.com/chenxgscuec/CPPFLPS .
engineering, biomedical,computer science, interdisciplinary applications,mathematical & computational biology,biology