Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences

Kirill Kryukov,Mahoko Takahashi Ueda,So Nakagawa,Tadashi Imanishi
DOI: https://doi.org/10.1093/bioinformatics/btz144
IF: 5.8
2019-02-25
Bioinformatics
Abstract:Abstract Summary DNA sequence databases use compression such as gzip to reduce the required storage space and network transmission time. We describe Nucleotide Archival Format (NAF)—a new file format for lossless reference-free compression of FASTA and FASTQ-formatted nucleotide sequences. Nucleotide Archival Format compression ratio is comparable to the best DNA compressors, while providing dramatically faster decompression. We compared our format with DNA compressors: DELIMINATE and MFCompress, and with general purpose compressors: gzip, bzip2, xz, brotli and zstd. Availability and implementation NAF compressor and decompressor, as well as format specification are available at https://github.com/KirillKryukov/naf. Format specification is in public domain. Compressor and decompressor are open source under the zlib/libpng license, free for nearly any use. Supplementary information Supplementary data are available at Bioinformatics online.
biochemical research methods,biotechnology & applied microbiology,mathematical & computational biology
What problem does this paper attempt to address?