The Backpack Quotient Filter: a dynamic and space-efficient data structure for querying -mers with abundance

Victor Levallois,Francesco Andreace,Bertrand Le Gal,Yoann Dufresne,Pierre Peterlongo
DOI: https://doi.org/10.1101/2024.02.15.580441
2024-02-18
Abstract:Genomic data sequencing has become indispensable for elucidating the complexities of biological systems. As databases storing genomic information, such as the European Nucleotide Archive, continue to grow exponentially, efficient solutions for data manipulation are imperative. One funda-mental operation that remains challenging is querying these databases to determine the presence or absence of specific sequences and their abundance within datasets. This paper introduces a novel data structure indexing -mers (substrings of length ), the Back-pack Quotient Filter (BQF), which serves as an alternative to the Counting Quotient Filter (CQF). The BQF offers enhanced space efficiency compared to the CQF while retaining key properties, including abundance information and dynamicity, with a negligible false positive rate, below 10 %. The approach involves a redefinition of how abundance information is handled within the structure, alongside with an independent strategy for space efficiency. We show that the BQF uses 4x less space than the CQF on some of the most complex data to index: sea-water metagenomics sequences. Furthermore, we show that space efficiency increases as the amount of data to be indexed increases, which is in line with the original objective of scaling to ever-larger datasets.
Bioinformatics
What problem does this paper attempt to address?