Abstract:K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at <a class="link-external link-http" href="http://github.com/ged-lab/khmer" rel="external noopener nofollow">this http URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to efficiently perform k - mer counting in large - scale nucleotide sequence analysis. Specifically, the authors propose a method based on a probabilistic data structure (Count - Min Sketch) for efficiently calculating the k - mer frequencies in sequencing datasets online. This method aims to solve the problems of high memory consumption and slow speed faced by existing methods when dealing with large - scale datasets. ### Main problems solved by the paper: 1. **Efficient online k - mer counting**: Traditional k - mer counting methods (such as hash tables, suffix arrays, and trie structures) require a large amount of memory when dealing with large - scale datasets and are less efficient in updating and retrieving k - mer counts. The paper proposes a probabilistic data structure based on Count - Min Sketch, which can efficiently perform online updates and retrievals of k - mer counts in memory. 2. **Memory efficiency**: Count - Min Sketch is more memory - efficient than exact data structures when dealing with sparse datasets. Although using Count - Min Sketch will introduce systematic over - counting, this error is acceptable, especially when dealing with large - scale datasets. 3. **Support for real - time applications**: Count - Min Sketch allows real - time updates and retrievals of k - mer counts in memory, which supports streaming applications such as digital normalization. ### Main contributions: - **khmer software package**: The authors have developed a software package named khmer that implements an efficient k - mer counting method based on Count - Min Sketch. khmer has been compared with other existing k - mer counting tools (such as Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle, and KAnalyze) in terms of memory usage, speed, and disk usage and has shown excellent performance. - **Performance analysis**: The paper has analyzed in detail the speed, memory usage, and mis - counting rate of khmer in generating k - mer frequency distributions and retrieving individual k - mer counts and has compared it with other tools. - **Application cases**: The paper has also explored the performance of khmer in specific applications such as sequencing error analysis, low - abundance k - mer trimming, and digital normalization. ### Specific implementation: - **Count - Min Sketch**: The paper describes the implementation details of Count - Min Sketch, including how to reduce collisions through multiple hash tables of different sizes and how to obtain k - mer counts by calculating the minimum value. - **Memory and time efficiency**: The paper analyzes the memory usage and time complexity of Count - Min Sketch and provides optimization methods for selecting the number and size of hash tables. - **Mis - counting rate analysis**: The paper has discussed in detail the mis - counting rate of Count - Min Sketch and its impact on k - mer counting and has proven through experiments that the mis - counting rate is low and predictable on actual datasets. In conclusion, this paper proposes a probabilistic data structure based on Count - Min Sketch, which solves the problem of efficient k - mer counting in large - scale nucleotide sequence analysis and has important theoretical and practical significance.

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

KMC 2: Fast and resource-frugal $k$-mer counting

MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting

KmerCo: A lightweight K-mer counting technique with a tiny memory footprint

Kmerlight: fast and accurate k-mer abundance estimation

Kmcex: Memory-Frugal and Retrieval-Efficient Encoding of Counted K-Mers.

Hyper-k-mers: efficient streaming k-mers representation

CHTKC: a robust and efficient k-mer counting algorithm based on a lock-free chaining hash table

KCOSS: an ultra-fast k-mer counter for assembled genome analysis

High‐frequency K‐mer Counting at Low Memory Footprint

K-mer Counting: Memory-Efficient Strategy, Parallel Computing and Field of Application for Bioinformatics

High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism

TopKmer: Parallel High Frequency K-mer Counting on Distributed Memory

MAFcounter: An efficient tool for counting the occurrences of k-mers in MAF files

An efficient parallel algorithm for multiple sequence similarities calculation using a low complexity method.

Gerbil: A Fast and Memory-Efficient $k$-mer Counter with GPU-Support

Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics

KMC 3: counting and manipulating k-mer statistics

Memory-bound k-mer selection for large and evolutionary diverse reference libraries

A Survey of K-mer Methods and Applications in Bioinformatics