A Compression Algorithm of Fastq File Based on Distribution Characteristics Analysis

Shengyu Lu,Hanping Chen,Lifa Peng,Beizhan Wang,Hongji Wang,Xiuze Zhou
DOI: https://doi.org/10.1109/iccse.2018.8468742
2018-01-01
Abstract:With the continuous development of sequencing technology scientists in the cost of DNA sequencing in reduce gradually, it also makes the number of DNA sequencing data to increase substantially. While the genome data is need to store, the traditional computer room has not enough to store such large data. Therefore, more and more genome data need to be uploaded to the cloud. Due to the speed of growth of communication have been much faster than the growth of the genomic data, so it is particularly important for genome data compression to reduce the cost of scientific research institutions and it is of great significance to speed up the sharing of genomic data. Fastq file is an important format of genomic data, and now the compression algorithm for fastq files is mainly include of DSRC, FQC, etc. These algorithms are also compressed based on the characteristics of fastq files. In order to improve the rate of compression, we propose an algorithm of DDSRC and establish the statistical models for the distribution characteristics of strings in fastq files to perform more efficient compression algorithms. This paper will explain the algorithm based on the distribution characteristics analysis and compare the results with other compression algorithms.
What problem does this paper attempt to address?