Repeat and haplotype aware error correction in nanopore sequencing reads with DeChat

Yichen Li,Enlian Chen,Jialu Xu,Wenhai Zhang,Xiangxiang Zeng,Yuansheng Liu,Xiao Luo
DOI: https://doi.org/10.1101/2024.05.09.593079
2024-05-10
Abstract:Error self-correction is a pivotal first step in the analysis of long-read sequencing data. However, most existing methods for this purpose are primarily tailored for noisy sequencing data with error rates exceeding 5%, often collapsing true variants in repeats and haplotypes. Alternatively, some methods are heavily optimized for PacBio HiFi reads, leaving a gap in methods specifically designed for Nanopore R10 reads basecalled with high accuracy or super accuracy models, which typically have error rates below 2%. Here, we introduce DeChat, a novel approach specifically designed for Nanopore R10 reads. DeChat enables repeat- and haplotype-aware error correction, leveraging the strengths of both de Bruijn graphs and variant-aware multiple sequence alignment to create a synergistic approach. This approach avoids read overcorrection, ensuring that variants in repeats and haplotypes are preserved while sequencing errors are accurately corrected. Benchmarking experiments reveal that reads corrected using DeChat exhibit substantially fewer errors, ranging from several times to two orders of magnitude lower, compared to the current state-of-the-art approaches. Furthermore, the application of DeChat for error correction significantly improves genome assembly across various aspects. DeChat is implemented as a highly efficient, standalone, and user-friendly software and is publicly available at https://github.com/LuoGroup2023/DeChat.
Bioinformatics
What problem does this paper attempt to address?