DEGAP: Dynamic elongation of a genome assembly path

Yicheng Huang,Ziyuan Wang,Monica A Schmidt,Handong Su,Lizhong Xiong,Jianwei Zhang
DOI: https://doi.org/10.1093/bib/bbae194
IF: 9.5
2024-04-27
Briefings in Bioinformatics
Abstract:Genome assembly remains to be a major task in genomic research. Despite the development over the past decades of different assembly software programs and algorithms, it is still a great challenge to assemble a complete genome without any gaps. With the latest DNA circular consensus sequencing (CCS) technology, several assembly programs can now build a genome from raw sequencing data to contigs; however, some complex sequence regions remain as unresolved gaps. Here, we present a novel gap-filling software, DEGAP ( D ynamic E longation of a G enome A ssembly P ath), that resolves gap regions by utilizing the dual advantages of accuracy and length of high-fidelity (HiFi) reads. DEGAP identifies differences between reads and provides 'GapFiller' or 'CtgLinker' modes to eliminate or shorten gaps in genomes. DEGAP adopts an iterative elongation strategy that automatically and dynamically adjusts parameters according to three complexity factors affecting the genome to determine the optimal extension path. DEGAP has already been successfully applied to decipher complex genomic regions in several projects and may be widely employed to generate more gap-free genomes.
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?
The problem that this paper attempts to address is the issue of gaps in genome assembly. Despite significant progress made by existing genome assembly software and algorithms, it remains challenging to complete a full genome assembly without any gaps. This is particularly true for some complex sequence regions, such as highly repetitive DNA sequences, where existing assembly tools often fail to effectively fill these gaps. To this end, the authors have developed a new gap-filling software called DEGAP (Dynamic Elongation of a Genome Assembly Path). DEGAP leverages the accuracy and length advantages of high-fidelity (HiFi) sequencing reads, dynamically adjusting parameters to select appropriate reads, and gradually extending the gap edge sequences to achieve gap filling or shortening. DEGAP employs an iterative elongation strategy, capable of automatically and dynamically adjusting parameters to handle genome regions of varying complexity and determine the optimal elongation path. Specifically, DEGAP provides two operational modes: 1. **GapFiller**: Used for chromosome-level genome assembly, primarily targeting known gaps for filling. 2. **CtgLinker**: Used for non-chromosome-level genome assembly, dealing with unknown gaps. Through these two modes, DEGAP can effectively reduce the number of gaps across the entire genome without generating additional sequencing data, thereby improving the continuity and completeness of genome assembly. The paper validates the effectiveness of DEGAP through two case studies: one involving the filling of rice centromere sequences and the other improving human genome sequences. These results demonstrate that DEGAP excels in filling gaps in complex regions, producing more complete and continuous genome assemblies.