TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes

Ilia Minkin,Son Pham,Paul Medvedev
DOI: https://doi.org/10.1093/bioinformatics/btw609
IF: 5.8
2016-09-21
Bioinformatics
Abstract:MOTIVATION: de Bruijn graphs have been proposed as a data structure to facilitate the analysis of related whole genome sequences, in both a population and comparative genomic settings. However, current approaches do not scale well to many genomes of large size (such as mammalian genomes).RESULTS: In this article, we present TwoPaCo, a simple and scalable low memory algorithm for the direct construction of the compacted de Bruijn graph from a set of complete genomes. We demonstrate that it can construct the graph for 100 simulated human genomes in less than a day and eight real primates in < 2 h, on a typical shared-memory machine. We believe that this progress will enable novel biological analyses of hundreds of mammalian-sized genomes.AVAILABILITY AND IMPLEMENTATION: Our code and data is available for download from github.com/medvedevgroup/TwoPaCo.CONTACT: ium125@psu.edu.SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
biochemical research methods,biotechnology & applied microbiology,mathematical & computational biology
What problem does this paper attempt to address?