Theory of local k-mer selection with applications to long-read alignment

Jim Shaw,Yun William Yu
DOI: https://doi.org/10.1101/2021.05.22.445262
2021-05-23
Abstract:Abstract Motivation Selecting a subset of k-mers in a string in a local manner is a common task in bioinformatics tools for speeding up computation. Arguably the most well-known and common method is the minimizer technique, which selects the ‘lowest-ordered’ k-mer in a sliding window. Recently, it has been shown that minimizers are a sub-optimal method for selecting subsets of k-mers when mutations are present. There is however a lack of understanding behind the theory of why certain methods perform well. Results We first theoretically investigate the conservation metric for k-mer selection methods. We derive an exact expression for calculating the conservation of a k-mer selection method. This turns out to be tractable enough for us to prove closed-form expressions for a variety of methods, including (open and closed) syncmers, ( α, b, n )-words, and an upper bound for minimizers. As a demonstration of our results, we modified the minimap2 read aligner to use a more optimal k-mer selection method and demonstrate that there is up to an 8.2% relative increase in number of mapped reads. Availability and supplementary information Simulations and supplementary methods available at https://github.com/bluenote-1577/local-kmer-selection-results . os-minimap2 is a modified version of minimap2 and available at https://github.com/bluenote-1577/os-minimap2 . Contact jshaw@math.toronto.edu
What problem does this paper attempt to address?