Flnc: Machine Learning Improves the Identification of Novel Full-length Long Noncoding RNAs from RNA Sequencing Data Without Transcriptional Initiation Profiles

Zixiu Li,Peng Zhou,Euijin Kwon,Katherine Fitzgerald,Zhiping Weng,Chan Zhou
DOI: https://doi.org/10.1101/2022.08.02.502545
2022-01-01
Abstract:Long noncoding RNAs (lncRNAs) play critical regulatory roles in human development and disease. However, many lncRNAs have yet to be annotated. The conventional approach to identifying novel lncRNAs from RNA sequencing (RNA-seq) data is to find transcripts without coding potential. This approach has a false discovery rate of 30-75%. The majority of these misidentified lncRNAs are RNA fragments or transcriptional noise and lack defined transcription start sites, which are marked by H3K4me3 histone modifications. Therefore, the accuracy of lncRNA identification can be improved by incorporating H3K4me3 chromatin immunoprecipitation sequencing (ChIP-seq) data. However, because of cost, time, and limited sample availability, most RNA-seq data lacks such data. This paucity of H3K4me3 data greatly hinders the efforts to accurately identify novel lncRNAs. To address this problem, we have developed software, Flnc, to identify both novel and annotated full-length lncRNAs from RNA-seq data without H3K4me3 profiles. Flnc integrates machine learning models built incorporating four types of features: transcript length, promoter signature, multiple exons, and genomic location. Flnc achieves state-of-the-art prediction power with an AUROC score over 0.92. Flnc significantly improves the prediction accuracy from less than 50% using the conventional approach to over 85%. Flnc is available via <https://github.com/CZhouLab/Flnc> . ### Competing Interest Statement The authors have declared no competing interest.
What problem does this paper attempt to address?