Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm

Haoyu Cheng,Gregory T. Concepcion,Xiaowen Feng,Haowen Zhang,Heng Li
DOI: https://doi.org/10.1038/s41592-020-01056-5
IF: 48
2021-02-01
Nature Methods
Abstract:Haplotype-resolved de novo assembly is the ultimate solution to the study of sequence variations in a genome. However, existing algorithms either collapse heterozygous alleles into one consensus copy or fail to cleanly separate the haplotypes to produce high-quality phased assemblies. Here we describe hifiasm, a de novo assembler that takes advantage of long high-fidelity sequence reads to faithfully represent the haplotype information in a phased assembly graph. Unlike other graph-based assemblers that only aim to maintain the contiguity of one haplotype, hifiasm strives to preserve the contiguity of all haplotypes. This feature enables the development of a graph trio binning algorithm that greatly advances over standard trio binning. On three human and five nonhuman datasets, including California redwood with a ~30-Gb hexaploid genome, we show that hifiasm frequently delivers better assemblies than existing tools and consistently outperforms others on haplotype-resolved assembly.
biochemical research methods
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to develop a new assembly algorithm that can use high - fidelity (HiFi) long - read - length sequencing data to generate high - quality haplotype - resolved genome assemblies. Specifically, the paper introduces a new assembler named hifiasm, which aims to: 1. **Overcome the limitations of existing assembly methods**: Existing assembly tools either collapse heterozygous alleles into one consensus sequence, resulting in the loss of half of the genetic information, or produce highly fragmented assembly results when separating haplotypes. These problems are particularly evident when dealing with heterozygous samples, especially in the human genome, where the heterozygosity rate is usually low and the sequencing error rate is high. 2. **Improve the assembly quality of heterozygous samples**: By using HiFi reads (long - read - length sequencing data with a low error rate), hifiasm can more accurately identify and separate heterozygous alleles during the assembly process, thereby generating more complete and contiguous haplotype - resolved assemblies. 3. **Optimize the trio binning strategy**: Traditional trio binning methods have limitations when dealing with highly heterozygous regions and may lead to incorrect read - length assignments. hifiasm can more effectively correct these errors and improve the quality of haplotype - resolved assemblies by combining the phase information of HiFi reads and the structure of the assembly graph. 4. **Be applicable to multiple genome types**: The paper evaluated the performance of hifiasm on different types of genomes, including non - human genomes such as mouse, maize, strawberry, frog, and sequoia, as well as the human genome. The results show that hifiasm can generate higher - quality assembly results than other assembly tools in most cases, especially when dealing with complex repeat regions and highly heterozygous regions. In summary, the main objective of this paper is to develop an assembly algorithm that can fully utilize high - fidelity long - read - length sequencing data to generate high - quality haplotype - resolved genome assemblies, thereby better resolving sequence variations in the genome.