Benchmarking of Hi-C tools for scaffolding de novo genome assemblies

Lia Obinu,Urmi Trivedi,Andrea Porceddu
DOI: https://doi.org/10.1101/2023.05.16.540917
2024-02-15
Abstract:The implementation of Hi-C reads in the genome assembly allows to order large regions of the genome in scaffolds, obtaining chromosome-level assemblies. Several bioinformatics tools have been developed for genome scaffolding with Hi-C, and all have pros and cons which need to be carefully evaluated before adoption. We developed assemblyQC, a bash pipeline that combines QUAST, BUSCO, Merqury and, optionally, Liftoff, plus a gene positioning validation script to evaluate and benchmark the performance of three scaffolders, 3d-dna, SALSA2, and YaHS, on two de novo assembly of Arabidopsis thaliana obtained from the same raw PacBio HiFi and ONT data. In our analysis, YaHS proved to be the best-performing bioinformatic tool for scaffolding of genome assembly.
Genomics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate and compare the performance of different Hi - C tools in de novo genome assembly in order to determine which tool is the most suitable for constructing chromosome - level genome maps. Specifically, the authors selected three commonly - used Hi - C scaffolding tools: 3d - dna, SALSA2 and YaHS, and used two Arabidopsis thaliana genome de novo assemblies based on the same original PacBio HiFi and ONT data for benchmarking. ### Main problems: 1. **Improve the quality of genome assembly**: Improve the continuity and accuracy of de novo assembly by introducing Hi - C data, especially assembling genome fragments into chromosome - level structures. 2. **Select the best tool**: Provide guidance for future genome projects by conducting a detailed evaluation of the performance of different tools and selecting the most appropriate Hi - C scaffolding tool. 3. **Verify the effectiveness of tools**: Comprehensively evaluate the performance of these tools in practical applications through multiple evaluation metrics (such as QUAST, BUSCO, Merqury, gene collinearity analysis, etc.). ### Specific research objectives: - Develop and use the `assemblyQC` pipeline, combined with QUAST, BUSCO, Merqury and the optional Liftoff tool, to comprehensively evaluate the assembly results generated by different scaffolders. - Compare the performance of the three scaffolders (3d - dna, SALSA2 and YaHS) on two independent Arabidopsis thaliana de novo assemblies, including key indicators such as genome fragmentation rate, N50, N90, L50, L90, etc. - Verify the accuracy and consistency of the assembly results through gene collinearity analysis and Hi - C contact maps. ### Conclusion: According to the experimental results, YaHS performs best in most evaluation metrics, especially in reducing fragmentation, increasing N50 and N90 values, and maintaining high genome integrity and accuracy. Therefore, the paper recommends YaHS as the currently optimal Hi - C scaffolding tool. ### Formula examples: - **N50**: \[ N50=\text{length of the shortest contig such that the sum of all contigs of this length or longer is at least 50\% of the total assembly size} \] - **L50**: \[ L50 = \text{minimum number of contigs whose combined length makes up at least 50\% of the total assembly size} \] Through these evaluations, researchers can provide strong support and guidance for future genome assembly projects.