A Novel Compression Algorithm for High-Throughput DNA Sequence Based on Huffman Coding Method

Chuan He,Huaiqiu Zhu
DOI: https://doi.org/10.1109/cisp-bmei.2018.8633219
2018-01-01
Abstract:NGS (Next generation sequencing) technology can concurrently accomplish sequencing of a large scale of DNA data in one time, resulting in a large number of DNA short reads. The transportation and processing of DNA data are thus faced with difficulties. There are two kinds of compression methods for high-throughput DNA data, reference-based method and reference-free method. Reference-free method is adaptive for compressing DNA data from different species without storing large genome for reference. In this paper, we proposed a reference-free algorithm, named HDC, realizing high-throughput DNA compression based on Huffman coding and dictionary method. The algorithm builds multiple dictionaries through Huffman coding and uses the dictionary to finish the compression and decompression. By testing on the genomes of human, green monkey and horse, HDC's lowest compression rate reaches 0.192 when compressing the human genome with chromosome as compression unit. We also compared HDC with a conventional compression algorithm gzip, and two reference-free DNA compression algorithms Leon and ORCOM. The result demonstrates that the HDC algorithm performs significantly best among the three algorithms.
What problem does this paper attempt to address?