Integrating Hi-C links with assembly graphs for chromosome-scale assembly

Jay Ghurye,Arang Rhie,Brian P. Walenz,Anthony Schmitt,Siddarth Selvaraj,Mihai Pop,Adam M. Phillippy,Sergey Koren
DOI: https://doi.org/10.1371/journal.pcbi.1007273
2019-08-21
PLoS Computational Biology
Abstract:Long-read sequencing and novel long-range assays have revolutionized <em>de novo</em> genome assembly by automating the reconstruction of reference-quality genomes. In particular, Hi-C sequencing is becoming an economical method for generating chromosome-scale scaffolds. Despite its increasing popularity, there are limited open-source tools available. Errors, particularly inversions and fusions across chromosomes, remain higher than alternate scaffolding technologies. We present a novel open-source Hi-C scaffolder that does not require an <em>a priori</em> estimate of chromosome number and minimizes errors by scaffolding with the assistance of an assembly graph. We demonstrate higher accuracy than the state-of-the-art methods across a variety of Hi-C library preparations and input assembly sizes. The Python and C++ code for our method is openly available at <a href="https://github.com/machinegun/SALSA">https://github.com/machinegun/SALSA</a>.Hi-C technology was originally proposed to study the 3D organization of a genome. Recently, it has also been applied to assemble large eukaryotic genomes into chromosome-scale scaffolds. Despite this, there are few open source methods to generate these assemblies. Existing methods are also prone to small inversion errors due to noise in the Hi-C data. In this work, we address these challenges and develop a method, named SALSA2. SALSA2 uses sequence overlap information from an assembly graph to correct inversion errors and provide accurate chromosome-scale assemblies.
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?