Abstract:Genome assembly is a computational technique that involves piecing together deoxyribonucleic acid (DNA) fragments generated by sequencing technologies to create a comprehensive and precise representation of the entire genome. Generating a high-quality human reference genome is a crucial prerequisite for comprehending human biology, and it is also vital for downstream genomic variation analysis. Many efforts have been made over the past few decades to create a complete and gapless reference genome for humans by using a diverse range of advanced sequencing technologies. Several available tools are aimed at enhancing the quality of haploid and diploid human genome assemblies, which include contig assembly, polishing of contig errors, scaffolding and variant phasing. Selecting the appropriate tools and technologies remains a daunting task despite several studies have investigated the pros and cons of different assembly strategies. The goal of this paper was to benchmark various strategies for human genome assembly by combining sequencing technologies and tools on two publicly available samples (NA12878 and NA24385) from Genome in a Bottle. We then compared their performances in terms of continuity, accuracy, completeness, variant calling and phasing. We observed that PacBio HiFi long-reads are the optimal choice for generating an assembly with low base errors. On the other hand, we were able to produce the most continuous contigs with Oxford Nanopore long-reads, but they may require further polishing to improve on quality. We recommend using short-reads rather than long-reads themselves to improve the base accuracy of contigs from Oxford Nanopore long-reads. Hi-C is the best choice for chromosome-level scaffolding because it can capture the longest-range DNA connectedness compared to 10× linked-reads and Bionano optical maps. However, a combination of multiple technologies can be used to further improve the quality and completeness of genome assembly. For diploid assembly, hifiasm is the best tool for human diploid genome assembly using PacBio HiFi and Hi-C data. Looking to the future, we expect that further advancements in human diploid assemblers will leverage the power of PacBio HiFi reads and other technologies with long-range DNA connectedness to enable the generation of high-quality, chromosome-level and haplotype-resolved human genome assemblies.

CSA: A High-Throughput Chromosome-Scale Assembly Pipeline for Vertebrate Genomes.

User-friendly genome assembly and gene annotation pipelines for vertebrates

Integrating Hi-C links with assembly graphs for chromosome-scale assembly

Pipeasm: a tool for automated large chromosome-scale genome assembly and evaluation

CSA: a Web Service for the Complete Process of ChIP-Seq Analysis

Chrom-pro: A User-Friendly Toolkit for De-novo Chromosome Assembly and Genomic Analysis

Reference-Assisted Chromosome Assembly

Genome sequence assembly evaluation using long-range sequencing data

ZGA: a flexible pipeline for read processing, de novo assembly and annotation of prokaryotic genomes

A Novel High-Accuracy Genome Assembly Method Utilizing a High-Throughput Workflow

Benchmarking multi-platform sequencing technologies for human genome assembly

Multi-CSAR: a web server for scaffolding contigs using multiple reference genomes

Easing genomic surveillance: A comprehensive performance evaluation of long-read assemblers across multi-strain mixture data of HIV-1 and Other pathogenic viruses for constructing a user-friendly bioinformatic pipeline

Benchmarking of Hi-C tools for scaffolding de novo genome assemblies

Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data

A Scalable and Accurate Targeted Gene Assembly Tool (SAT-Assembler) for Next-Generation Sequencing Data

Gcapda: a Haplotype-Resolved Diploid Assembler

Efficient and Highly Continuous Chromosome-Level Genome Assembly of a Diploid Amniote Genome

AsmMix: an efficient haplotype-resolved hybrid de novo genome assembling pipeline

Chromosome-scale, haplotype-resolved assembly of human genomes

HiPGA: A High Performance Genome Assembler for Short Read Sequence Data