Scalable De Novo Genome Assembly Using Pregel

Da Yan,Hongzhi Chen,James Cheng,Zhenkun Cai,Bin Shao
DOI: https://doi.org/10.48550/arXiv.1801.04453
2018-01-13
Abstract:De novo genome assembly is the process of stitching short DNA sequences to generate longer DNA sequences, without using any reference sequence for alignment. It enables high-throughput genome sequencing and thus accelerates the discovery of new genomes. In this paper, we present a toolkit, called PPA-assembler, for de novo genome assembly in a distributed setting. The operations in our toolkit provide strong performance guarantees, and can be assembled to implement various sequencing strategies. PPA-assembler adopts the popular {\em de Bruijn graph} based approach for sequencing, and each operation is implemented as a program in Google's Pregel framework for big graph processing. Experiments on large real and simulated datasets demonstrate that PPA-assembler is much more efficient than the state-of-the-arts and provides good sequencing quality.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?