GBZ file format for pangenome graphs

Jouni Sirén,Benedict Paten
DOI: https://doi.org/10.1093/bioinformatics/btac656
IF: 5.8
2022-11-15
Bioinformatics
Abstract:Motivation: Pangenome graphs representing aligned genome assemblies are being shared in the text-based Graphical Fragment Assembly format. As the number of assemblies grows, there is a need for a file format that can store the highly repetitive data space efficiently. Results: We propose the GBZ file format based on data structures used in the Giraffe short-read aligner. The format provides good compression, and the files can be efficiently loaded into in-memory data structures. We provide compression and decompression tools and libraries for using GBZ graphs, and we show that they can be efficiently used on a variety of systems. Availability and implementation: C++ and Rust implementations are available at https://github.com/jltsiren/gbwtgraph and https://github.com/jltsiren/gbwt-rs, respectively. Supplementary information: Supplementary data are available at Bioinformatics online.
What problem does this paper attempt to address?