Engineering Relative Compression of Genomes

Szymon Grabowski,Sebastian Deorowicz
DOI: https://doi.org/10.48550/arXiv.1103.2351
2011-03-11
Computational Engineering, Finance, and Science
Abstract:Technology progress in DNA sequencing boosts the genomic database growth at faster and faster rate. Compression, accompanied with random access capabilities, is the key to maintain those huge amounts of data. In this paper we present an LZ77-style compression scheme for relative compression of multiple genomes of the same species. While the solution bears similarity to known algorithms, it offers significantly higher compression ratios at compression speed over a order of magnitude greater. One of the new successful ideas is augmenting the reference sequence with phrases from the other sequences, making more LZ-matches available.
What problem does this paper attempt to address?