Verkko2: Integrating proximity ligation data with long-read De Bruijn graphs for efficient telomere-to-telomere genome assembly, phasing, and scaffolding

Dmitry Antipov,Mikko Rautiainen,Sergey Nurk,Brian P. Walenz,Steven J. Solar,Adam M. Phillippy,Sergey Koren
DOI: https://doi.org/10.1101/2024.12.20.629807
2024-12-22
Abstract:The Telomere-to-Telomere Consortium recently finished the first truly complete sequence of a human genome. To resolve the most complex repeats, this project relied on the semi-manual combination of long, accurate PacBio HiFi and ultra-long Oxford Nanopore sequencing reads. The Verkko assembler later automated this process, achieving complete assemblies for approximately half of the chromosomes in a diploid human genome. However, the first version of Verkko was computationally expensive and could not resolve all regions of a typical human genome. Here we present Verkko2, which implements a more efficient read correction algorithm, improves repeat resolution and gap closing, introduces proximity-ligation-based haplotype phasing and scaffolding, and adds support for multiple long-read data types. These enhancements allow Verkko to assemble all regions of a diploid human genome, including the short arms of the acrocentric chromosomes and both sex chromosomes. Together, these changes increase the number of telomere-to-telomere scaffolds by twofold, reduce runtime by fourfold, and improve assembly correctness. On a panel of 19 human genomes, Verkko2 assembles an average of 39 of 46 complete chromosomes as scaffolds, with 21 of these assembled as gapless contigs. Together, these improvements enable telomere-to-telomere comparative and pangenomics, at scale.
Bioinformatics
What problem does this paper attempt to address?