Benchmarking of bioinformatics tools for the hybrid de novo assembly of human whole-genome sequencing data

Adrian Munoz-Barrera,Luis A. Rubio-Rodriguez,David Jaspez,Almudena Corrales,Itahisa Marcelino-Rodriguez,Jose M. Lorenzo-Salazar,Rafaela Gonzalez-Montelongo,Carlos Flores
DOI: https://doi.org/10.1101/2024.05.28.595812
2024-05-29
Abstract:Accurate and complete de novo assembled genomes sustain variant identification and catalyze the discovery of new genomic features and biological functions. However, accurate and precise de novo assemblies of large and complex genomes remains a challenging task. Long-read sequencing data alone or in hybrid mode combined with more accurate short-read sequences facilitate the de novo assembly of genomes. A number of software exists for de novo genome assembly from long-read data although specific performance comparisons to assembly human genomes are lacking. Here we benchmarked 11 different pipelines including four long-read only assemblers and three hybrid assemblers, combined with four polishing schemes for de novo genome assembly of a human reference material sequenced with Oxford Nanopore Technologies and Illumina. In addition, the best performing choice was validated in a non-reference routine laboratory sample. Software performance was evaluated by assessing the quality of the assemblies with QUAST, BUSCO, and Merqury metrics, and the computational costs associated with each of the pipelines were also assessed. We found that Flye was superior to all other assemblers, especially when relying on Ratatosk error-corrected long-reads. Polishing improved the accuracy and continuity of the assemblies and the combination of two rounds of Racon and Pilon achieved the best results. The assembly of the non-reference sample showed comparable assembly metrics as those of the reference material. Based on the results, a complete optimal analysis pipeline for the assembly, polishing, and contig curation developed on Nextflow is provided to enable efficient parallelization and built-in dependency management to further advance in the generation of high-quality and chromosome-level human assemblies.
Bioinformatics
What problem does this paper attempt to address?