Puzzle Hi-C: an accurate scaffolding software

Guoliang Lin,Zhiru Huang,Tingsong Yue,Jing Chai,Yan Li,Huimin Yang,Wanting Qin,Guobing Yang,Robert W. Murphy,Ya-ping Zhang,Zijie Zhang,Wei Zhou,Jing Luo
DOI: https://doi.org/10.1101/2024.01.29.577879
2024-01-31
Abstract:High-quality, chromosome-scale genomes are essential for genomic analyses. Analyses, including 3D genomics, epigenetics, and comparative genomics rely on a high-quality genome assembly, which is often accomplished with the assistance of Hi-C data. Current Hi-C-assisted assembling algorithms either generate ordering and orientation errors or fail to assemble high-quality chromosome-level scaffolds. Here, we offer the software Puzzle Hi-C, which uses Hi-C reads to accurately assign contigs or scaffolds to chromosomes. Puzzle Hi-C uses the triangle region instead of the square region to count interactions in a Hi-C heatmap. This strategy dramatically diminishes scaffolding interference caused by long-range interactions. This software also introduces a dynamic, triangle window strategy during assembly. Initially small, the window expands with interactions to produce more effective clustering. Puzzle Hi-C outperforms available scaffolding tools.
Bioinformatics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the errors and inefficiencies in the process of high - throughput chromosome conformation capture technology (Hi - C) - assisted genome assembly. Specifically, current Hi - C - assisted assembly algorithms, when generating high - quality genomes at the chromosome level, either produce sorting and orientation errors or fail to successfully assemble high - quality chromosome - level scaffolds. These problems mainly stem from the interference of long - distance interactions in the assembly process, leading to incorrect assembly results. To solve these problems, the authors developed a new software - Puzzle Hi - C. Puzzle Hi - C significantly reduces the interference of long - distance interactions on scaffold assembly by using triangular regions instead of square regions to count the interactions in the Hi - C heatmap. In addition, Puzzle Hi - C also introduces a dynamic triangular window strategy, in which the window size expands as the interactions increase during the assembly process, enabling more effective clustering. These improvements make Puzzle Hi - C perform better than existing scaffold assembly tools on both simulated and real data. In summary, this paper aims to provide a more accurate and efficient Hi - C - assisted genome assembly method to reduce or avoid common errors in the assembly process, improve the quality and continuity of genome assembly, and further promote subsequent genomics analyses, such as three - dimensional genomics, epigenetics, and comparative genomics.