Abstract:Advanced sequencing machines dramatically speed up the generation of genomic data, which makes the demand of efficient compression of sequencing data extremely urgent and significant. As the most difficult part of the standard sequencing data format FASTQ, compression of the quality score has become a conundrum in the development of FASTQ compression. Existing lossless compressors of quality scores mainly utilize specific patterns generated by specific sequencer and complex context modeling techniques to solve the problem of low compression ratio. However, the main drawbacks of these compressors are the problem of weak robustness which means unstable or even unavailable results of sequencing files and the problem of slow compression speed. Meanwhile, some compressors attempt to construct a fine-grained index structure to solve the problem of slow random access decompression speed. However, they solve the problem at the sacrifice of compression speed and at the expense of large index files, which makes them inefficient and impractical. Therefore, an efficient lossless compressor of quality scores with strong robustness, high compression ratio, fast compression and random access decompression speed is urgently needed and of great significance. In this paper, based on the idea of maximizing the use of hardware resources, LCQS, a lossless compression tool specialized for quality scores, was proposed. It consists of four sequential processing steps: partitioning, indexing, packing and parallelizing. Experimental results reveal that LCQS outperforms all the other state-of-the-art compressors on all criteria except for the compression speed on the dataset SRR1284073. Furthermore, LCQS presents strong robustness on all the test datasets, with its acceleration ratios of compression speed increasing by up to 29.1x, its file size reducing by up to 28.78%, and its random access decompression speed increasing by up to 2.1x. Additionally, LCQS also exhibits strong scalability. That is, the compression speed increases almost linearly as the size of input dataset increases. The ability to handle all different kinds of quality scores and superiority in compression ratio and compression speed make LCQS a high-efficient and advanced lossless quality score compressor, along with its strength of fast random access decompression. Our tool LCQS can be downloaded from https://github.com/SCUT-CCNL/LCQSand freely available for non-commercial usage.

Machete: an Efficient Lossy Floating-Point Compressor Designed for Time Series Databases

Deep Dict: Deep Learning-based Lossy Time Series Compressor for IoT Data

MOST: Model-Based Compression with Outlier Storage for Time Series Data

Erasing-based lossless compression method for streaming floating-point time series

Accelerating Lossy Compression on HPC Datasets Via Partitioning Computation for Parallel Processing

Time Series Data Encoding for Efficient Storage

Time-universal data compression and prediction

A Versatile Compression Method for Floating-Point Data Stream

Czip: A Fast Lossless Compression Algorithm for Climate Data

LCQS: an Efficient Lossless Compression Tool of Quality Scores with Random Access Functionality

Adaptive Encoding Strategies for Erasing-Based Lossless Floating-Point Compression

TVStore: Automatically Bounding Time Series Storage via Time-Varying Compression

Performance Optimization for Relative-Error-Bounded Lossy Compression on Scientific Data.

Accelerating Relative-error Bounded Lossy Compression for HPC Datasets with Precomputation-Based Mechanisms.

Change a Bit to save Bytes: Compression for Floating Point Time-Series Data

A Fast Compression Algorithm for Seismic Data from Non-Cable Seismographs

Spatiotemporally adaptive compression for scientific dataset with feature preservation -- a case study on simulation data with extreme climate events analysis

Real-Time Lossless Compression for Ultrahigh-Density Synchrophasor and Point-on-Wave Data

ADT-FSE: A New Encoder for SZ

Massively-Parallel Lossless Data Decompression

CompressDB: Enabling Efficient Compressed Data Direct Processing for Various Databases