BAUM: A DNA Assembler by Adaptive Unique Mapping and Local Overlap-Layout-Consensus

Anqi Wang,Zheng Li,Zhanyu Wang,Lei M. Li
DOI: https://doi.org/10.48550/arXiv.1609.03073
2016-09-11
Abstract:Genome assembly from the high-throughput sequencing (HTS) reads is a fundamental yet challenging computational problem. An intrinsic challenge is the uncertainty caused by the widespread repetitive elements. Here we get around the uncertainty using the notion of uniquely mapped (UM) reads, which motivated the design of a new assembler BAUM. It mainly consists of two types of iterations. The first type of iterations constructs initial contigs from a reference, say a genome of a species that could be quite distant, by adaptive read mapping, filtration by the reference's unique regions, and reference updating. A statistical test is proposed to split the layouts at possible structural variation sites. The second type of iterations includes mapping, scaffolding/contig-extension, and contig merging. We extend each contig by locally assembling the reads whose mates are uniquely mapped to an end of the contig. Instead of the de Bruijn graph method, we take the overlap-layout-consensus (OLC) paradigm. The OLC is implemented by parallel computation, and has linear complexity with respect to the number of contigs. The adjacent extended contigs are merged if their alignment is confirmed by the adjusted gap distance. Throughout the assembling, the mapping criterion is selected by probabilistic calculations. These innovations can be used complementary to the existing de novo assemblers. Applying this novel method to the assembly of wild rice Oryza longistaminata genome, we achieved much improved contig N50, 18.8k, compared with other assemblers. The assembly was further validated by contigs constructed from an independent library of long 454 reads.
Genomics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges encountered by high - throughput sequencing (HTS) reads in genome assembly, especially the uncertainty problems caused by widely - existing repetitive elements. Specifically, the paper proposes a new assembler named BAUM, which bypasses these uncertainties through the methods of Adaptive Unique Mapping and Local Overlap - Layout - Consensus (OLC) to improve the quality of genome assembly. ### Main problems to be solved: 1. **Uncertainty Caused by Repetitive Elements**: Repetitive sequences in the genome can make it difficult for short reads to be accurately located to their original positions, which is a major problem in genome assembly. 2. **Selection of the Initial Reference Genome**: Even when using a reference genome that is far from the target species, BAUM can gradually improve the assembly quality by iteratively updating the reference genome. 3. **High Efficiency and Linear Complexity**: BAUM adopts the local OLC method to achieve linear complexity, making the assembly of large - scale genomes more efficient. ### Key Points of the Solution: 1. **Adaptive Unique Mapping**: - Use uniquely - mapped reads (UM reads) to generate initial contigs. - Split the layout at possible structural variation sites through statistical tests to reduce incorrect mergers. - Adaptively select mapping criteria to improve the mapping rate and accuracy. 2. **Local Overlap - Layout - Consensus (Local OLC)**: - Use the information of uniquely - mapped paired - end or mate - pair to group reads for local assembly. - Implement local OLC through parallel computing with linear complexity. - Confirm whether adjacent extended contigs can be merged according to the statistically - adjusted gap distance. ### Experimental Verification: - **Simulation Evaluation**: Evaluate the reliability of the initial contigs through simulated data. The results show that the assembly error is effectively controlled at the structural level. - **Practical Application**: Apply BAUM to the genome assembly of African wild rice (Oryza longistaminata). The results are significantly better than those of other assemblers. The final contig N50 reaches 18.8 kbp, while the contig N50 of other assemblers is usually less than 1 kbp. ### Summary: BAUM effectively solves the problem of repetitive elements in genome assembly through the innovative methods of Adaptive Unique Mapping and Local OLC, and improves the assembly quality and efficiency. This method is not only applicable to species far from the reference genome, but also has linear complexity and is suitable for processing large - scale genome data.