Clair3-RNA: A deep learning-based small variant caller for long-read RNA sequencing data

Zhenxian Zheng,Xian Yu,Lei Chen,Yan-Lam Lee,Cheng Xin,Angel On Ki Wong,Miten Jain,Rupesh K Kesharwani,Fritz J Sedlazeck,Ruibang Luo
DOI: https://doi.org/10.1101/2024.11.17.624050
2024-11-18
Abstract:Variant calling using long-read RNA sequencing (lrRNA-seq) can be applied to diverse tasks, such as capturing full-length isoforms and gene expression profiling. It poses challenges, however, due to higher error rates than DNA data, the complexities of transcript diversity, RNA editing events, etc. In this paper, we propose Clair3-RNA, the first deep learning-based variant caller tailored for lrRNA-seq data. Clair3-RNA leverages the strengths of the Clair series pipelines and incorporates several techniques optimized for lrRNA-seq data, such as uneven coverage normalization, refinement of training materials, editing site discovery, and the incorporation of phasing haplotype to enhance variant-calling performance. Clair3-RNA is available for various platforms, including PacBio and ONT complementary DNA sequencing (cDNA), and ONT direct RNA sequencing (dRNA). Our results demonstrated that Clair3-RNA achieved a ~91% SNP F1-score on the ONT platform using the latest ONT SQK-RNA004 kit (dRNA004) and a ~92% SNP F1-score in PacBio Iso-Seq and MAS-Seq for variants supported by at least four reads. The performance reached a ~95% and ~96% F1-score for ONT and PacBio, respectively, with at least ten supporting reads and disregarding the zygosity. With read phased, the performance reached ~97% for ONT and ~98% for PacBio. Extensive evaluation of various GIAB samples demonstrated that Clair3-RNA consistently outperformed existing callers and is capable of distinguishing ~67% and ~93% RNA high-quality editing sites on ONT dRNA004 and PacBio Iso-Seq datasets, respectively. Clair3-RNA is open-source and available at (https://github.com/HKU-BAL/Clair3-RNA).
Bioinformatics
What problem does this paper attempt to address?