Function-Assigned Masked Superstrings as a Versatile and Compact Data Type for 𝑘-Mer Sets

Ondřej Sladký,Pavel Veselý,Karel Břinda
DOI: https://doi.org/10.1101/2024.03.06.583483
2024-06-17
Abstract:The exponential growth of DNA sequencing data calls for novel space-efficient algorithms for their compression and search. State-of-the-art approaches often use 𝑘-merization for data tokenization, yet efficiently representing and querying 𝑘-mer sets remains a significant bioinformatics challenge. Our recent work introduced the concept of masked superstrings, which compactly represent 𝑘-mer sets without reliance on common structural assumptions. However, the applicability of masked superstrings for set operations and membership queries remained open. Here, we develop the 𝑓-masked superstring framework, which integrates demasking functions 𝑓, enabling efficient 𝑘-mer set operations through concatenation. Combined with a tailored version of the FM-index, this framework provides a versatile, compact data structure for 𝑘-mer sets. We demonstrate its effectiveness with the FMSI program, which, when evaluated on bacterial pan-genomes, improves space efficiency by a factor of 1.4 to 4.5 compared to leading single 𝑘-mer-set indexing methods such as SSHash and SBWT. Overall, our work highlights the potential of 𝑓-masked superstrings as a versatile elementary data type for 𝑘-mer sets.
Bioinformatics
What problem does this paper attempt to address?