Abstract:Background Many medicinal plants are known for their complex genomes with high ploidy, heterozygosity, and repetitive content which pose severe challenges for genome sequencing of those species. Long reads from Oxford nanopore sequencing technology (ONT) or Pacific Biosciences Single Molecule, Real-Time (SMRT) sequencing offer great advantages in de novo genome assembly, especially for complex genomes with high heterozygosity and repetitive content. Currently, multiple allotetraploid species have sequenced their genomes by long-read sequencing. However, we found that a considerable proportion of these genomes (7.9% on average, maximum 23.7%) could not be covered by NGS (Next Generation Sequencing) reads (uncovered region by NGS reads, UCR) suggesting the questionable and low-quality of those area or genomic areas that can’t be sequenced by NGS due to sequencing bias. The underlying causes of those UCR in the genome assembly and solutions to this problem have never been studied. Methods In the study, we sequenced the tetraploid genome of Veratrum dahuricum (Turcz.) O. Loes (VDL), a Chinese medicinal plant, with ONT platform and assembled the genome with three strategies in parallel. We compared the qualities, coverage, and heterozygosity of the three ONT assemblies with another released assembly of the same individual using reads from PacBio circular consensus sequencing (CCS) technology, to explore the cause of the UCR. Results By mapping the NGS reads against the three ONT assemblies and the CCS assembly, we found that the coverage of those ONT assemblies by NGS reads ranged from 49.15 to 76.31%, much smaller than that of the CCS assembly (99.53%). And alignment between ONT assemblies and CCS assembly showed that most UCR can be aligned with CCS assembly. So, we conclude that the UCRs in ONT assembly are low-quality sequences with a high error rate that can’t be aligned with short reads, rather than genomic regions that can’t be sequenced by NGS. Further comparison among the intermediate versions of ONT assemblies showed that the most probable origin of those errors is a combination of artificial errors introduced by “self-correction” and initial sequencing error in long reads. We also found that polishing the ONT assembly with CCS reads can correct those errors efficiently. Conclusions Through analyzing genome features and reads alignment, we have found the causes for the high proportion of UCR in ONT assembly of VDL are sequencing errors and additional errors introduced by self-correction. The high error rates of ONT-raw reads make them not suitable for self-correction prior to allotetraploid genome assembly, as the self-correction will introduce artificial errors to > 5% of the UCR sequences. We suggest high-precision CCS reads be used to polish the assembly to correct those errors effectively for polyploid genomes.

Comparison of the Two Up-to-date Sequencing Technologies for Genome Assembly: HiFi Reads of Pacific Biosciences Sequel II System and Ultralong Reads of Oxford Nanopore

Pseudo-Sanger Sequencing: Massively Parallel Production of Long and Near Error-Free Reads Using NGS Technology

Benchmarking multi-platform sequencing technologies for human genome assembly

Systematic Comparison of the Performances of De Novo Genome Assemblers for Oxford Nanopore Technology Reads From Piroplasm

Pre-Assembly NGS Correction of ONT Reads Achieves HiFi-Level Assembly Quality

Matching Excellence: ONT’s Rise to Parity with PacBio in Genome Reconstruction of Non-Model Bacterium with High GC Content

Matching excellence: Oxford Nanopore Technologies' rise to parity with Pacific Biosciences in genome reconstruction of non-model bacterium with high G+C content

Evaluating long-read de novo assembly tools for eukaryotic genomes: insights and considerations

Gapless assembly of complete human and plant chromosomes using only nanopore sequencing

Comparison of the Two Major Classes of Assembly Algorithms: Overlap-Layout-consensus and De-Bruijn-graph

Benchmarking of Long-Read Sequencing, Assemblers and Polishers for Yeast Genome

Benchmarking of next and third generation sequencing technologies and their associated algorithms for de novo genome assembly

Comparison of ONT and CCS sequencing technologies on the polyploid genome of a medicinal plant showed that high error rate of ONT reads are not suitable for self-correction

Comprehensive assessment of 11 de novo HiFi assemblers on complex eukaryotic genomes and metagenomes

The Hitchhiker’s Guide to Sequencing Data Types and Volumes for Population-Scale Pangenome Construction

AlignGraph2: similar genome-assisted reassembly pipeline for PacBio long reads

A Practical Comparison Of De Novo Genome Assembly Software Tools For Next-Generation Sequencing Technologies

Optimization of the In-Silico Mate-Pair Method Improved Contiguity and Accuracy of Genome Assembly

Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks and Future Directions

Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome

Do we still need Illumina sequencing data? Evaluating Oxford Nanopore Technologies R10.4.1 flow cells and the Rapid v14 library prep kit for Gram negative bacteria whole genome assemblies