Pre-Assembly NGS Correction of ONT Reads Achieves HiFi-Level Assembly Quality

Evgeniy Mozheiko,Heng Yi,Anzhi Lu,Heitung Kong,Yong Hou,Yan Zhou,Hui Gao
DOI: https://doi.org/10.1101/2024.07.12.603260
2024-07-13
Abstract:Recently developed hybrid assemblies can achieve Telomere-to-Telomere (T2T) completeness of some chromosomes. However, such approaches involve sequencing a large volume of both Pacific Biosciences high-fidelity (HiFi) and Oxford Nanopore Technologies (ONT) sequencing reads. Along with this, third-generation sequencing techniques are rapidly advancing, increasing the available length and accuracy. To reduce the final cost of genome assembly, here we investigated the possibility of assembly from low-coverage samples and with only ONT corrected by Next-Generation Sequencing (NGS) sequencing reads. We demonstrated that ONT-based assembly approaches corrected by NGS can achieve performance metrics comparable to more expensive hybrid approaches based on HiFi sequencing. We investigated the assembly of different chromosomes and the low-coverage performance of state-of-the-art hybrid assembly tools, including Verkko and Hifiasm, as well as ONT-based assemblers such as Shasta and Flye. We rigorously evaluated the performance of MGI, Illumina, and stLFR NGS technologies across various aspects of hybrid genome assembly, including pre-assembly correction, haplotype phasing, and polishing, and found them to be similarly effective. Additionally, we proposed two-round assembly methods that utilize stLFR linked-read data to achieve assembly phasing performance comparable to that obtained with trio data.
Genomics
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to reduce the cost of genome assembly by using low - coverage samples and only Oxford Nanopore Technologies (ONT) sequence data combined with Next - Generation Sequencing (NGS) technology, while achieving performance indicators comparable to more expensive hybrid assembly methods (such as hybrid assembly based on PacBio High - Fidelity (HiFi) sequencing). Specifically, the research explores the following points: 1. **Cost - effectiveness**: Reduce the overall cost of genome assembly by using low - coverage ONT and NGS data instead of high - coverage HiFi and ultra - long ONT data. 2. **Assembly quality**: Evaluate the quality of genome assembly using NGS - corrected ONT data, especially whether it can achieve similar performance indicators, such as NG50, QV score, k - mer - based completeness, etc., compared with the hybrid HiFi + ONT assembly method. 3. **Low - coverage performance**: Study the performance of NGS - corrected ONT assembly tools at different coverages (10X, 20X, 30X, 50X), especially the stability and performance at low coverages. 4. **Comparison of different NGS technologies**: Compare the performance of three NGS technologies, MGI, Illumina and stLFR, in genome assembly, especially the effects in haplotype phasing, ONT read correction, assembly polishing and QV score evaluation. Through these studies, the authors hope to show that even without pursuing end - to - end (Telomere - to - Telomere, T2T) complete assembly, high - quality and cost - effective results can be achieved by using a combination of regular R9 ONT reads and accurate NGS reads for genome assembly. In addition, the study also proposes a two - round assembly method, using stLFR linked - read data to achieve assembly phasing performance comparable to that of tri - parental data.