Abstract:A key step in sequence similarity search is to identify seeds that are found in both the query and the reference sequence. A seed is a shorter substring (e.g., a k-mer) or pattern (e.g., a spaced k-mer) constructed from the sequences. A well-known trade-off in applications such as read mapping is that longer seeds offer fast searches through fewer spurious matches but lower sensitivity in variable regions as longer seeds are more likely to harbor mutations. Some recent developments on seed constructs have considered approximate (or fuzzy) seeds such as k-min-mers, strobemers, BLEND, SubSeqHash, TensorSketch, and more, that can match over smaller mutations and, thus, suffer less from sensitivity issues in variable regions. Nevertheless, the sensitivity-to-speed trade-off still exists for such constructs. In other applications, such as genome assembly, using multiple sizes of k-mers is effective. While this can be achieved in read mapping through, e.g., MEM construction from an FM-index, such seed constructs are typically much slower than hash-based constructs. To this end, we introduce multi-context seeds (MCS). In brief, MCS are strobemers where the hashes of individual strobes are partitioned in the hash value representing the seed. Such partitioning enables a cache-friendly approach to search for both full and partial matches of a subset of strobes. For example, both the full strobemer and the first strobe (a k-mer) can be queried. We demonstrate that MCS improves sequence matching statistics over standard strobemers and k-mers without compromising seed uniqueness. We demonstrate the practical applicability of MCS by implementing them in strobealign. Strobealign with MCS comes at no cost in memory and only little cost in runtime while offering increased mapping accuracy over default strobealign using simulated Illumina reads across genomes of various complexity. We also show that strobealign with MCS outperforms minimap2 in short-read mapping and is comparable to BWA-MEM in accuracy in high-variability sequences. MCS provides a fast seed alternative that addresses the trade-offs between seed length and alignment accuracy.

MISSH: Fast Hashing of Multiple Spaced Seeds

Efficient Seeding for Error-Prone Sequences with SubseqHash2

Spaced seeds improve k-mer-based metagenomic classification

Seeding with Minimized Subsequence.

Multi-context seeds enable fast and high-accuracy read mapping

An efficient parallel algorithm for multiple sequence similarities calculation using a low complexity method.

Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping

RawHash: Enabling Fast and Accurate Real-Time Analysis of Raw Nanopore Signals for Large Genomes

Extraction of long k-mers using spaced seeds

Fractional Hitting Sets for Efficient and Lightweight Genomic Data Sketching

conLSH: Context based Locality Sensitive Hashing for Mapping of noisy SMRT Reads

Space-efficient computation of k-mer dictionaries for large values of k

RasBhari: optimizing spaced seeds for database searching, read mapping and alignment-free sequence comparison

Perm: Efficient Mapping of Short Sequencing Reads with Periodic Full Sensitive Spaced Seeds

Fast Scalable Supervised Hashing

RawHash2: Mapping Raw Nanopore Signals Using Hash-Based Seeding and Adaptive Quantization

Fast and accurate short read alignment with hybrid hash-tree data structure

Fast and Accurate Hashing Via Iterative Nearest Neighbors Expansion.

DNA Hash Pooling and its Applications

Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics