Abstract:String kernels are typically used to compare genome-scale sequences whose length makes alignment impractical, yet their computation is based on data structures that are either space-inefficient, or incur large slowdowns. We show that a number of exact kernels on pairs of strings of total length n, like the k-mer kernel, the substrings kernels, a number of length-weighted kernels, the minimal absent words kernel, and kernels with Markovian corrections, can all be computed in O(nd) time and in o(n) bits of space in addition to the input, using just a rangeDistinct\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\mathtt {rangeDistinct}$$\end{document} data structure on the Burrows–Wheeler transform of the input strings that takes O(d) time per element in its output. The same bounds hold for a number of measures of compositional complexity based on multiple values of k, like the k-mer profile and the k-th order empirical entropy, and for calibrating the value of k using the data. All such algorithms become O(n) using a suitable implementation of the rangeDistinct\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\mathtt {rangeDistinct}$$\end{document} data structure, and by concatenating them to a suitable BWT construction algorithm, we can compute all the mentioned kernels and complexity measures, directly from the input strings, in O(n) time and in O(nlogσ)\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$O(n\log {\sigma })$$\end{document} bits of space in addition to the input, where σ\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\sigma $$\end{document} is the size of the alphabet. Using similar data structures, we also show how to build a compact representation of the variable-length Markov chain of a string T of length n, that takes just 3nlogσ+o(nlogσ)\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$3n\log {\sigma }+o(n\log {\sigma })$$\end{document} bits of space, and that can be learnt in randomized O(n) time using O(nlogσ)\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$O(n\log {\sigma })$$\end{document} bits of space in addition to the input. Such model can then be used to assign a probability to a query string S of length m in O(m) time and in 2m+o(m)\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$2m+o(m)$$\end{document} bits of additional space, thus providing an alternative, compositional measure of the similarity between S and T that does not require alignment.

Prokrustean Graph: A substring index for rapid k-mer size analysis

Fulgor: a fast and compact k-mer index for large-scale matching and color queries

Lossless Indexing with Counting de Bruijn Graphs

Brisk: Exact resource-efficient dictionary for k-mers

Hyper-k-mers: efficient streaming k-mers representation

K2R: Tinted de Bruijn Graphs implementation for efficient read extraction from sequencing datasets

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure

StLiter: A Novel Algorithm to Iteratively Build the Compacted De Bruijn Graph from Many Complete Genomes

MEM-based pangenome indexing for k-mer queries

Metannot: A succinct data structure for compression of colors in dynamic de Bruijn graphs

A Framework for Space-Efficient String Kernels

Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics

Memory Efficient De Bruijn Graph Construction

Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections

KmerCo: A lightweight K-mer counting technique with a tiny memory footprint

A Survey of K-mer Methods and Applications in Bioinformatics

Indexing All Life's Known Biological Sequences

KMC 2: Fast and resource-frugal $k$-mer counting

Efficient Storage and Analysis of Genomic Data: A k-mer Frequency Mapping and Image Representation Method

TahcoRoll: An Efficient Approach for Signature Profiling in Genomic Data through Variable-Length k-mers

Representation of k-Mer Sets Using Spectrum-Preserving String Sets