GoldPolish-Target: Targeted long-read genome assembly polishing
Emily Zhang,Lauren Coombe,Johnathan Wong,Rene L Warren,Inanc Birol
DOI: https://doi.org/10.1101/2024.09.27.615516
2024-10-01
Abstract:Advanced long-read sequencing technologies, such as those from Oxford Nanopore Technologies and Pacific Biosciences, are finding a wide use in de novo genome sequencing projects. However, long reads typically have higher error rates relative to short reads. If left unaddressed, subsequent genome assemblies may exhibit high base error rates that compromise the reliability of downstream analysis. Several specialized error correction tools for genome assemblies have since emerged, employing a range of algorithms and strategies to improve base quality. However, despite these efforts, many genome assembly workflows still produce regions with elevated error rates, such as gaps filled with unpolished or ambiguous bases. To address this, we introduce GoldPolish-Target, a modular targeted sequence polishing pipeline. Coupled with GoldPolish, a linear-time genome assembly algorithm, GoldPolish-Target isolates and polishes user-specified assembly loci, offering a resource-efficient means for polishing targeted regions of draft genomes.
Experiments using Drosophila melanogaster and Homo sapiens datasets demonstrate that GoldPolish-Target can reduce insertion/deletion (indel) and mismatch errors by up to 49.2% and 53.4% respectively, achieving base accuracy values upwards of 99.9% (Phred score Q>30). This polishing accuracy is comparable to the current state-of-the-art, Medaka, while exhibiting up to 36-fold shorter run times and consuming 94% less memory, on average.
GoldPolish-Target, in contrast to most other polishing tools, offers the ability to target specific regions of a genome assembly for polishing, providing a computationally light-weight and highly scalable solution for base error correction.
https://github.com/bcgsc/goldpolish
Genomics