Abstract:A key step in sequence similarity search is to identify seeds that are found in both the query and the reference sequence. A seed is a shorter substring (e.g., a k-mer) or pattern (e.g., a spaced k-mer) constructed from the sequences. A well-known trade-off in applications such as read mapping is that longer seeds offer fast searches through fewer spurious matches but lower sensitivity in variable regions as longer seeds are more likely to harbor mutations. Some recent developments on seed constructs have considered approximate (or fuzzy) seeds such as k-min-mers, strobemers, BLEND, SubSeqHash, TensorSketch, and more, that can match over smaller mutations and, thus, suffer less from sensitivity issues in variable regions. Nevertheless, the sensitivity-to-speed trade-off still exists for such constructs. In other applications, such as genome assembly, using multiple sizes of k-mers is effective. While this can be achieved in read mapping through, e.g., MEM construction from an FM-index, such seed constructs are typically much slower than hash-based constructs. To this end, we introduce multi-context seeds (MCS). In brief, MCS are strobemers where the hashes of individual strobes are partitioned in the hash value representing the seed. Such partitioning enables a cache-friendly approach to search for both full and partial matches of a subset of strobes. For example, both the full strobemer and the first strobe (a k-mer) can be queried. We demonstrate that MCS improves sequence matching statistics over standard strobemers and k-mers without compromising seed uniqueness. We demonstrate the practical applicability of MCS by implementing them in strobealign. Strobealign with MCS comes at no cost in memory and only little cost in runtime while offering increased mapping accuracy over default strobealign using simulated Illumina reads across genomes of various complexity. We also show that strobealign with MCS outperforms minimap2 in short-read mapping and is comparable to BWA-MEM in accuracy in high-variability sequences. MCS provides a fast seed alternative that addresses the trade-offs between seed length and alignment accuracy.

Efficient Seeding for Error-Prone Sequences with SubseqHash2

Seeding with Minimized Subsequence.

MISSH: Fast Hashing of Multiple Spaced Seeds

Multi-context seeds enable fast and high-accuracy read mapping

Efficient Approximate Subsequence Matching Using Hybrid Signatures

FANSe2: a robust and cost-efficient alignment tool for quantitative next-generation sequencing applications.

A compressive seeding algorithm in conjunction with reordering-based compression

The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote

SaAlign: Multiple DNA/RNA Sequence Alignment and Phylogenetic Tree Construction Tool for Ultra-Large Datasets and Ultra-Long Sequences Based on Suffix Array

An efficient parallel algorithm for multiple sequence similarities calculation using a low complexity method.

Aligning High Error Rate Reads Using Enhanced Sparse Suffix Array Index

PUNAS: A Parallel Ungapped-Alignment-Featured Seed Verification Algorithm for Next-Generation Sequencing Read Alignment

RawHash2: Mapping Raw Nanopore Signals Using Hash-Based Seeding and Adaptive Quantization

Short Read Alignment Based On Maximal Approximate Match Seeds

CHTKC: a robust and efficient k-mer counting algorithm based on a lock-free chaining hash table

Overlap Digraph： an Effective Model for Finding Good Spaced Seeds for Biological Sequence Local Alignment

ResSeq: Enhancing Short-Read Sequencing Alignment by Rescuing Error-Containing Reads

Rhat: Fast Alignment of Noisy Long Reads with Regional Hashing.

An Efficient Filtration Method Based on Variable-Length Seeds for Sequence Alignment.

Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping