Evaluation of deep-learning-based lncRNA identification tools

Cheng Yang,Man Zhou,Haoling Xie,Huaiqiu Zhu
DOI: https://doi.org/10.1101/683425
2019-01-01
Abstract:Long non-coding RNAs (lncRNAs, length above 200 nt) exert crucial biological roles and have been implicated in cancers. To characterize newly discovered transcripts, one major issue is to distinguish lncRNAs from mRNAs. Since experimental methods are time-consuming and costly, computational methods are preferred for large-scale lncRNA identification. In a recent study, Amin et al. evaluated three deep-learning-based lncRNA identification tools (i.e., lncRNAnet, LncADeep, and lncFinder) and concluded “The LncADeep PR (precision recall) curve is just above the no-skill model and LncADeep showed poor overall performance”. This surprising conclusion is based on the authors’ use of a non-default setting of LncADeep. Actually, LncADeep has two models, one for full-length transcripts, and the other for transcripts including partial-length. Being aware of the difficulty of assembling full-length transcripts from RNA-seq dataset, LncADeep’s default model is for transcripts including partial-length. However, according to the results posted on Amin et al.’s website, the authors used LncADeep with full-length model, while they claimed to use the default setting of LncADeep, to identify lncRNAs from GENCODE dataset, which is composed of full- and partial-length transcripts. Thus, in their evaluation, the performance of LncADeep was underestimated. In this correspondence, we have tested LncADeep’s default setting (i.e., model for transcripts including partial-length) on the datasets used in Amin et al., and LncADeep achieved overall the best performance compared with the other tools’ results reported by Amin et al.
What problem does this paper attempt to address?