Genome sequence assembly evaluation using long-range sequencing data

Dengfeng Guan,Shane A. McCarthy,Jonathan M. D. Wood,Ying Sims,William Chow,Zemin Ning,Kerstin Howe,Guohua Wang,Yadong Wang,Richard Durbin
DOI: https://doi.org/10.1101/2022.05.10.491304
2022-01-01
Abstract:Genome sequences are computationally assembled from millions of much shorter sequencing reads. Although this process can be impressively accurate with long reads, it is still subject to a variety of types of errors, including large structural misassembly errors in addition to localised base pair substitutions. Recent advances in long single molecule sequencing in combination with other long-range technologies such as synthetic long read clouds and Hi-C have dramatically increased the contiguity of assembly. This makes it all the more important to be able to validate the structural integrity of the chromosomal scale assemblies now being generated. Here we describe a novel assembly evaluation tool, Asset, which evaluates the consistency of a proposed genome assembly with multiple primary long-range data sets, identifying both supported regions and putative structural misassemblies. We present tests on three de novo assemblies from a human, a goat and a fish species, demonstrating that Asset can identify structural misassemblies accurately by combining regionally supported evidence from long read and other raw sequencing data. Not only can Asset be used to assess overall assembly confidence, and discover specific problematic regions for downstream genome curation, a process that leads to improvement in genome quality, but it can also provide feedback to automated assembly pipelines. ### Competing Interest Statement R.D. is a consultant for Dovetail Inc.
What problem does this paper attempt to address?