FastqCLS: a FASTQ compressor for long-read sequencing via read reordering using a novel scoring model

Dohyeon Lee,Giltae Song
DOI: https://doi.org/10.1093/bioinformatics/btab696
IF: 5.8
2022-01-03
Bioinformatics
Abstract:Motivation: Over the past decades, vast amounts of genome sequencing data have been produced, requiring an enormous level of storage capacity. The time and resources needed to store and transfer such data cause bottlenecks in genome sequencing analysis. To resolve this issue, various compression techniques have been proposed to reduce the size of original FASTQ raw sequencing data, but these remain suboptimal. Long-read sequencing has become dominant in genomics, whereas most existing compression methods focus on short-read sequencing only. Results: We designed a compression algorithm based on read reordering using a novel scoring model for reducing FASTQ file size with no information loss. We integrated all data processing steps into a software package called FastqCLS and provided it as a Docker image for ease of installation and execution to help users easily install and run. We compared our method with existing major FASTQ compression tools using benchmark datasets. We also included new long-read sequencing data in this validation. As a result, FastqCLS outperformed in terms of compression ratios for storing long-read sequencing data. Availability and implementation: FastqCLS can be downloaded from https://github.com/krlucete/FastqCLS. Supplementary information: Supplementary data are available at Bioinformatics online.
What problem does this paper attempt to address?