K-means clustering based compression algorithm for the high-throughput DNA sequence

Li Tan,Jifeng Sun
DOI: https://doi.org/10.1109/ICALIP.2014.7009935
2014-01-01
Abstract:This paper proposes a compression algorithm based on K-means clustering for high-through DNA sequence (DNAC-K). In DNAC-K, we create cluster of sequences based on K-means clustering method at first, then iterate clusters according to the edit distances of subsequences, and finally, adopt Huffman coding to encode the result of clustering result. Experimental results on several sequencing data sets demonstrate better performance of DNAC-K than many of the current high-throughput DNA sequence compression algorithms.
What problem does this paper attempt to address?