General encoding of canonical -mers

Roland Wittler
DOI: https://doi.org/10.1101/2023.03.09.531845
2024-08-22
Abstract:To index or compare sequences efficiently, often -mers, i.e., substrings of fixed length , are used. For efficient indexing or storage, -mers are often encoded as integers, e.g., applying some bijective mapping between all possible σ -mers and the interval [0,σ -1], where σ is the alphabet size. In many applications, e.g., when the reading direction of a DNA-sequence is ambiguous, -mers are considered, i.e., the lexicographically smaller of a given -mer and its reverse (or reverse complement) is chosen as a representative. In naive encodings, canonical -mers are not evenly distributed within the interval [0,σ -1]. We present a minimal encoding of canonical -mers on alphabets of arbitrary size, i.e., a mapping to the interval [0,σ /2-1]. The approach is introduced for canonicalization under reversal and extended to canonicalization under reverse complementation. We further present a space and time efficient bit-based implementation for the DNA alphabet.
Bioinformatics
What problem does this paper attempt to address?