Abstract:BACKGROUND:With the rapid emergence of RNA databases and newly identified non-coding RNAs, an efficient compression algorithm for RNA sequence and structural information is needed for the storage and analysis of such data. Although several algorithms for compressing DNA sequences have been proposed, none of them are suitable for the compression of RNA sequences with their secondary structures simultaneously. This kind of compression not only facilitates the maintenance of RNA data, but also supplies a novel way to measure the informational complexity of RNA structural data, raising the possibility of studying the relationship between the functional activities of RNA structures and their complexities, as well as various structural properties of RNA based on compression.RESULTS:RNACompress employs an efficient grammar-based model to compress RNA sequences and their secondary structures. The main goals of this algorithm are two fold: (1) present a robust and effective way for RNA structural data compression; (2) design a suitable model to represent RNA secondary structure as well as derive the informational complexity of the structural data based on compression. Our extensive tests have shown that RNACompress achieves a universally better compression ratio compared with other sequence-specific or common text-specific compression algorithms, such as Gencompress, winrar and gzip. Moreover, a test of the activities of distinct GTP-binding RNAs (aptamers) compared with their structural complexity shows that our defined informational complexity can be used to describe how complexity varies with activity. These results lead to an objective means of comparing the functional properties of heteropolymers from the information perspective.CONCLUSION:A universal algorithm for the compression of RNA secondary structure as well as the evaluation of its informational complexity is discussed in this paper. We have developed RNACompress, as a useful tool for academic users. Extensive tests have shown that RNACompress is a universally efficient algorithm for the compression of RNA sequences with their secondary structures. RNACompress also serves as a good measurement of the informational complexity of RNA secondary structure, which can be used to study the functional activities of RNA molecules.

Reference Sequence Construction for Relative Compression of Genomes

Engineering Relative Compression of Genomes

Genome Compression Against a Reference

A Compressed Self-Index for Genomic Databases

DNA Lossless Differential Compression Algorithm based on Similarity of Genomic Sequence Database

Genomic Compression with Read Alignment at the Decoder

A Novel Compression Algorithm for High-Throughput DNA Sequence Based on Huffman Coding Method

A new efficient referential genome compression technique for FastQ files

A Pipeline for Constructing Reference Genomes for Large Cohort-Specific Metagenome Compression

Pareto Optimal Compression of Genomic Dictionaries, with or without Random Access in Main Memory

Reference-based genome compression using the longest matched substrings with parallelization consideration

An Efficient Biological Sequence Compression Technique Using LUT And Repeat In The Sequence

Analysis of Compression Techniques for DNA Sequence Data

Genbit Compress Tool(GBC): A Java-Based Tool to Compress DNA Sequences and Compute Compression Ratio(bits/base) of Genomes

A compressive seeding algorithm in conjunction with reordering-based compression

CoGI: Towards Compressing Genomes As an Image.

Modified HuffBit Compress Algorithm – An Application of R

Compression of high throughput sequencing data with probabilistic de Bruijn graph

A randomized optimal k-mer indexing approach for efficient parallel genome sequence compression

PLDSRC: A Multi-threaded Compressor/Decompressor for Massive DNA Sequencing Data

RNACompress: Grammar-based Compression and Informational Complexity Measurement of RNA Secondary Structure