Abstract:Motivation: Given a string S, a minimizer scheme is an algorithm defined by a triple (k,w,O) that samples a subset of k-mers (k-long substrings) from a string S. Specifically, it samples the smallest k-mer according to the order O from each window of w consecutive k-mers in S. Because consecutive windows can sample the same k-mer, the set of the sampled k-mers is typically much smaller than S. More generally, we consider substring sampling algorithms that respect a window guarantee: at least one k-mer must be sampled from every window of w consecutive k-mers. As a sampled k-mer is uniquely identified by its absolute position in S, we can define the density of a sampling algorithm as the fraction of distinct sampled positions. Good methods have low density which, by respecting the window guarantee, is lower bounded by 1/w. It is however difficult to design a sequence-agnostic algorithm with provably optimal density. In practice, the order O is usually implemented using a pseudo-random hash function to obtain the so-called random minimizer. This scheme is simple to implement, very fast to compute even in streaming fashion, and easy to analyze. However, its density is almost a factor of 2 away from the lower bound for large windows. Methods: In this work we introduce mod-sampling, a two-step sampling algorithm to obtain new minimizer schemes. Given a (small) parameter t, the mod-sampling algorithm finds the position p of the smallest t-mer in a window. It then samples the k-mer at position p mod w. The lr-minimizer uses t = k − w and the mod-minimizer uses t ≡ k (mod w). Results: These new schemes have provably lower density than random minimizers and other schemes when k is large compared to w, while being as fast to compute. Importantly, the mod-minimizer achieves optimal density when k → ∞. Although the mod-minimizer is not the first method to achieve optimal density for large k, its proof of optimality is simpler than previous work. We provide pseudocode for a number of other methods and compare to them. In practice, the mod-minimizer has considerably lower density than the random minimizer and other state-of-the-art methods, like closed syncmers and miniception, when k > w. We plugged the mod-minimizer into SSHash, a k-mer dictionary based on minimizers. For default parameters (w, k) = (11, 21), space usage decreases by 15% when indexing the whole human genome (GRCh38), while maintaining its fast query time.

Seeding with Minimized Subsequence.

Efficient Seeding for Error-Prone Sequences with SubseqHash2

Efficient Algorithms for Finding a Longest Common Increasing Subsequence

MISSH: Fast Hashing of Multiple Spaced Seeds

Multi-context seeds enable fast and high-accuracy read mapping

A compressive seeding algorithm in conjunction with reordering-based compression

A simple refined DNA minimizer operator enables 2-fold faster computation

Fundamental Limits of Multiple Sequence Reconstruction from Substrings

Maximum Match Subsequence Alignment Algorithm Finely Grained (MMSAA FG)

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote

The mod-minimizer: a simple and efficient sampling algorithm for long k-mers

An Optimal Algorithm to Find Seeds in Strings

Algorithms for the Uniqueness of the Longest Common Subsequence

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure

DNA Hash Pooling and its Applications

Perm: Efficient Mapping of Short Sequencing Reads with Periodic Full Sensitive Spaced Seeds

conLSH: Context based Locality Sensitive Hashing for Mapping of noisy SMRT Reads

Efficient algorithms for the longest common subsequence in $k$-length substrings

A simple method for finding related sequences by adding probabilities of alternative alignments

FANSe2: a robust and cost-efficient alignment tool for quantitative next-generation sequencing applications.