Abstract:Biological data mainly comprises of Deoxyribonucleic acid (DNA) and protein sequences. These are the biomolecules which are present in all cells of human beings. Due to the self-replicating property of DNA, it is a key constitute of genetic material that exist in all breathingcreatures. This biomolecule (DNA) comprehends the genetic material obligatory for the operational and expansion of all personified lives. To save DNA data of single person we require 10CD-ROMs.Moreover, this size is increasing constantly, and more and more sequences are adding in the public databases. This abundant increase in the sequence data arise challenges in the precise information extraction from this data. Since many data analyzing and visualization tools do not support processing of this huge amount of data. To reduce the size of DNA and protein sequence, many scientists introduced various types of sequence compression algorithms such as compress or gzip, Context Tree Weighting (CTW), Lampel Ziv Welch (LZW), arithmetic coding, run-length encoding and substitution method etc. These techniques have sufficiently contributed to minimizing the volume of the biological datasets. On the other hand, traditional compression techniques are also not much suitable for the compression of these types of sequential data. In this paper, we have explored diverse types of techniques for compression of large amounts of DNA Sequence Data. In this paper, the analysis of techniques reveals that efficient techniques not only reduce the size of the sequence but also avoid any information loss. The review of existing studies also shows that compression of a DNA sequence is significant for understanding the critical characteristics of DNA data in addition to improving storage efficiency and data transmission. In addition, the compression of the protein sequence is a challenge for the research community. The major parameters for evaluation of these compression algorithms include compression ratio, running time complexity etc.

K-means clustering based compression algorithm for the high-throughput DNA sequence

A Novel Compression Algorithm for High-Throughput DNA Sequence Based on Huffman Coding Method

Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data

Modified HuffBit Compress Algorithm – An Application of R

CD-HIT: accelerated for clustering the next-generation sequencing data

Data Clustering Algorithm for DNA Microarray Based on Graph Theory

A new DNA sequence entropy-based Kullback-Leibler algorithm for gene clustering

Optimization of GN Algorithm Based on DNA Computation

A randomized optimal k-mer indexing approach for efficient parallel genome sequence compression

An efficient parallel algorithm for multiple sequence similarities calculation using a low complexity method.

DDQR (dynamic DNA QR coding): An efficient algorithm to represent DNA barcode sequences

Analysis of Compression Techniques for DNA Sequence Data

ACO:lossless quality score compression based on adaptive coding order

DUHI: Dynamically updated hash index clustering method for DNA storage

A Novel Approach to Clustering Genome Sequences Using Inter-nucleotide Covariance

Genetic Sequence compression using Machine Learning and Arithmetic Encoding Decoding Techniques

PLDSRC: A Multi-threaded Compressor/Decompressor for Massive DNA Sequencing Data

Clover: tree structure-based efficient DNA clustering for DNA-based data storage

DNA Sequence Classification with Compressors

An Efficient Biological Sequence Compression Technique Using LUT And Repeat In The Sequence

Bioinformatics Methods for High-Throughput DNA Sequencing Data