These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure

Qingpeng Zhang,Jason Pell,Rosangela Canino-Koning,Adina Chuang Howe,C. Titus Brown
DOI: https://doi.org/10.1371/journal.pone.0101271
2014-07-15
Abstract:K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at <a class="link-external link-http" href="http://github.com/ged-lab/khmer" rel="external noopener nofollow">this http URL</a>.
Genomics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to efficiently perform k - mer counting in large - scale nucleotide sequence analysis. Specifically, the authors propose a method based on a probabilistic data structure (Count - Min Sketch) for efficiently calculating the k - mer frequencies in sequencing datasets online. This method aims to solve the problems of high memory consumption and slow speed faced by existing methods when dealing with large - scale datasets. ### Main problems solved by the paper: 1. **Efficient online k - mer counting**: Traditional k - mer counting methods (such as hash tables, suffix arrays, and trie structures) require a large amount of memory when dealing with large - scale datasets and are less efficient in updating and retrieving k - mer counts. The paper proposes a probabilistic data structure based on Count - Min Sketch, which can efficiently perform online updates and retrievals of k - mer counts in memory. 2. **Memory efficiency**: Count - Min Sketch is more memory - efficient than exact data structures when dealing with sparse datasets. Although using Count - Min Sketch will introduce systematic over - counting, this error is acceptable, especially when dealing with large - scale datasets. 3. **Support for real - time applications**: Count - Min Sketch allows real - time updates and retrievals of k - mer counts in memory, which supports streaming applications such as digital normalization. ### Main contributions: - **khmer software package**: The authors have developed a software package named khmer that implements an efficient k - mer counting method based on Count - Min Sketch. khmer has been compared with other existing k - mer counting tools (such as Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle, and KAnalyze) in terms of memory usage, speed, and disk usage and has shown excellent performance. - **Performance analysis**: The paper has analyzed in detail the speed, memory usage, and mis - counting rate of khmer in generating k - mer frequency distributions and retrieving individual k - mer counts and has compared it with other tools. - **Application cases**: The paper has also explored the performance of khmer in specific applications such as sequencing error analysis, low - abundance k - mer trimming, and digital normalization. ### Specific implementation: - **Count - Min Sketch**: The paper describes the implementation details of Count - Min Sketch, including how to reduce collisions through multiple hash tables of different sizes and how to obtain k - mer counts by calculating the minimum value. - **Memory and time efficiency**: The paper analyzes the memory usage and time complexity of Count - Min Sketch and provides optimization methods for selecting the number and size of hash tables. - **Mis - counting rate analysis**: The paper has discussed in detail the mis - counting rate of Count - Min Sketch and its impact on k - mer counting and has proven through experiments that the mis - counting rate is low and predictable on actual datasets. In conclusion, this paper proposes a probabilistic data structure based on Count - Min Sketch, which solves the problem of efficient k - mer counting in large - scale nucleotide sequence analysis and has important theoretical and practical significance.