Abstract:String kernels are typically used to compare genome-scale sequences whose length makes alignment impractical, yet their computation is based on data structures that are either space-inefficient, or incur large slowdowns. We show that a number of exact kernels on pairs of strings of total length n, like the k-mer kernel, the substrings kernels, a number of length-weighted kernels, the minimal absent words kernel, and kernels with Markovian corrections, can all be computed in O(nd) time and in o(n) bits of space in addition to the input, using just a rangeDistinct\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\mathtt {rangeDistinct}$$\end{document} data structure on the Burrows–Wheeler transform of the input strings that takes O(d) time per element in its output. The same bounds hold for a number of measures of compositional complexity based on multiple values of k, like the k-mer profile and the k-th order empirical entropy, and for calibrating the value of k using the data. All such algorithms become O(n) using a suitable implementation of the rangeDistinct\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\mathtt {rangeDistinct}$$\end{document} data structure, and by concatenating them to a suitable BWT construction algorithm, we can compute all the mentioned kernels and complexity measures, directly from the input strings, in O(n) time and in O(nlogσ)\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$O(n\log {\sigma })$$\end{document} bits of space in addition to the input, where σ\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\sigma $$\end{document} is the size of the alphabet. Using similar data structures, we also show how to build a compact representation of the variable-length Markov chain of a string T of length n, that takes just 3nlogσ+o(nlogσ)\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$3n\log {\sigma }+o(n\log {\sigma })$$\end{document} bits of space, and that can be learnt in randomized O(n) time using O(nlogσ)\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$O(n\log {\sigma })$$\end{document} bits of space in addition to the input. Such model can then be used to assign a probability to a query string S of length m in O(m) time and in 2m+o(m)\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$2m+o(m)$$\end{document} bits of additional space, thus providing an alternative, compositional measure of the similarity between S and T that does not require alignment.

Algorithms for all‐pairs Hamming distance based similarity

An efficient method for time series similarity search using binary code representation and hamming distance

A Faster Algorithm for Finding Closest Pairs in Hamming Metric

Constrained Pairwise and Center-Star Sequences Alignment Problems

An efficient parallel algorithm for multiple sequence similarities calculation using a low complexity method.

Efficient Parallel Partition-Based Algorithms for Similarity Search and Join with Edit Distance Constraints

Efficient Approximate Algorithms for the Closest Pair Problem in High Dimensional Spaces.

Communication-Efficient Jaccard Similarity for High-Performance Distributed Genome Comparisons

A Similarity Computing Algorithm for Proteins

Faster Algorithms for Text-to-Pattern Hamming Distances

A New Algorithm for Finding Closest Pair of Vectors

A Pivotal Prefix Based Filtering Algorithm for String Similarity Search

CuAPSS: A Hybrid CUDA Solution for AllPairs Similarity Search.

A Partition-Based Method for String Similarity Joins with Edit-Distance Constraints

Investigating the complexity of the double distance problems

Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping

A Framework for Space-Efficient String Kernels

Polylogarithmic Approximation for Edit Distance and the Asymmetric Query Complexity

Hamming Distance Oracle

Top-k String Similarity Search with Edit-Distance Constraints