EndHiC: assemble large contigs into chromosomal-level scaffolds using the Hi-C links from contig ends
Sen Wang,Hengchao Wang,Fan Jiang,Anqi Wang,Hangwei Liu,Hanbo Zhao,Boyuan Yang,Dong Xu,Yan Zhang,Wei Fan
DOI: https://doi.org/10.48550/arXiv.2111.15411
2021-11-30
Abstract:Motivation: The application of PacBio HiFi and ultra-long ONT reads have achieved huge progress in the contig-level assembly, but it is still challenging to assemble large contigs into chromosomes with available Hi-C scaffolding software, which all compute the contact value between contigs using the Hi-C links from the whole contig regions. As the Hi-C links of two adjacent contigs concentrate only at the neighbor ends of the contigs, larger contig size will reduce the power to differentiate adjacent (signal) and non-adjacent (noise) contig linkages, leading to a higher rate of mis-assembly.
Results: We present a software package EndHiC, which is suitable to assemble large contigs (> 1-Mb) into chromosomal-level scaffolds, using Hi-C links from only the contig end regions instead of the whole contig regions. Benefiting from the increased signal to noise ratio, EndHiC achieves much higher scaffolding accuracy compared to existing software LACHESIS, ALLHiC, and 3D-DNA. Moreover, EndHiC has few parameters, runs 10-1000 times faster than existing software, needs trivial memory, provides robustness evaluation, and allows graphic viewing of the scaffold results. The high scaffolding accuracy and user-friendly interface of EndHiC, liberate the users from labor-intensive manual checks and revision works.
Availability and implementation: EndHiC is written in Perl, and is freely available at <a class="link-external link-https" href="https://github.com/fanagislab/EndHiC" rel="external noopener nofollow">this https URL</a>. Contact: fanwei@caas.cn and milrazhang@163.com Supplementary information: Supplementary data are available at Bioinformatics online.
Genomics