EndHiC: assemble large contigs into chromosomal-level scaffolds using the Hi-C links from contig ends

Sen Wang,Hengchao Wang,Fan Jiang,Anqi Wang,Hangwei Liu,Hanbo Zhao,Boyuan Yang,Dong Xu,Yan Zhang,Wei Fan
DOI: https://doi.org/10.48550/arXiv.2111.15411
2021-11-30
Abstract:Motivation: The application of PacBio HiFi and ultra-long ONT reads have achieved huge progress in the contig-level assembly, but it is still challenging to assemble large contigs into chromosomes with available Hi-C scaffolding software, which all compute the contact value between contigs using the Hi-C links from the whole contig regions. As the Hi-C links of two adjacent contigs concentrate only at the neighbor ends of the contigs, larger contig size will reduce the power to differentiate adjacent (signal) and non-adjacent (noise) contig linkages, leading to a higher rate of mis-assembly. Results: We present a software package EndHiC, which is suitable to assemble large contigs (> 1-Mb) into chromosomal-level scaffolds, using Hi-C links from only the contig end regions instead of the whole contig regions. Benefiting from the increased signal to noise ratio, EndHiC achieves much higher scaffolding accuracy compared to existing software LACHESIS, ALLHiC, and 3D-DNA. Moreover, EndHiC has few parameters, runs 10-1000 times faster than existing software, needs trivial memory, provides robustness evaluation, and allows graphic viewing of the scaffold results. The high scaffolding accuracy and user-friendly interface of EndHiC, liberate the users from labor-intensive manual checks and revision works. Availability and implementation: EndHiC is written in Perl, and is freely available at <a class="link-external link-https" href="https://github.com/fanagislab/EndHiC" rel="external noopener nofollow">this https URL</a>. Contact: fanwei@caas.cn and milrazhang@163.com Supplementary information: Supplementary data are available at Bioinformatics online.
Genomics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use Hi - C data to accurately assemble large contigs into chromosome - level scaffolds in genome assembly. Although current technologies such as PacBio HiFi and ultra - long ONT read lengths have made great progress in contig - level assembly, existing Hi - C scaffolding software still faces challenges when dealing with large contigs. The main problem is that these software usually use Hi - C links in the entire contig area to calculate contact values, which has a weak ability to distinguish adjacent (signal) and non - adjacent (noise) links between large contigs, resulting in a high mis - assembly rate. To solve this problem, the author has developed a new Hi - C scaffolding tool - EndHiC. EndHiC constructs scaffolds by using only Hi - C links in the contig end regions, thereby increasing the signal - to - noise ratio and significantly improving the accuracy of scaffold assembly. In addition, EndHiC also has the advantages of few parameters, fast running speed, low memory consumption, providing robustness evaluation, and supporting graphical viewing of results.