Hyper-k-mers: efficient streaming k-mers representation

Igor Martayan,Lucas Robidou,Yoshihiro Shibuya,Antoine Limasset
DOI: https://doi.org/10.1101/2024.11.06.620789
2024-11-07
Abstract:K-mers have become ubiquitous in modern bioinformatics pipelines. A key factor in their success is the ability to filter out erroneous k-mers by removing those with low abundance. However, the vast number of distinct k-mers makes k-mer counting a significant resource bottleneck. Early tools addressed this issue by storing k-mers on disk. To mitigate the excessive redundancy caused by overlapping k-mers, super-k-mers were introduced, significantly decreasing memory usage. Nevertheless, consecutive super-k-mers still overlap by k − 1 bases, leading to some degree of inefficiency. In this work, we introduce hyper-k-mers, a novel approach that further reduces redundancy. Our contributions are three-fold. First, we propose hyper-k-mers, a new representation of k-mers that asymptotically decreases duplication compared to super-k-mers. Second, we present a theoretical analysis comparing the space efficiency of super-k-mers, syncmers, and hyper-k-mers. Our approach offers significant advantages by reducing the asymptotic lower bound from 6 bits per nucleotide for super-k-mers to 4 bits per kmer. Third, we present KFC, a k-mer counting algorithm that leverages hyper-k-mers. KFC offers significant practical advantages, including an order of magnitude improvement in memory usage compared to state-of-the-art tools. Notably, our experiments show that KFC is the only tool whose resource usage does not scale linearly with the size of k and is the fastest option for large k-mer sizes. Our tool KFC is open-source under AGPL3 and available at https://github.com/lrobidou/KFC along with the experiments scripts at https://github.com/imartayan/KFC_experiments.
Bioinformatics
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the memory efficiency of k - mer representation, especially when dealing with large - scale genomic data. Specifically, the author proposes a new k - mer representation method, hyper - k - mer, aiming to reduce the redundancy in super - k - mer (super - k - mer), thereby further reducing the memory usage. ### Background and Problem In modern bioinformatics, k - mer (a substring of length k) has become a standard tool for processing DNA sequences. The main advantage of k - mer is that it can quickly estimate sequence similarity and can handle duplicate data naturally. However, as the number of k - mers increases, storing these k - mers becomes very memory - consuming. Early methods solved this problem by storing k - mers on disk, but this method was inefficient. Later, super - k - mers (super - k - mers) were formed by partially recombining overlapping k - mers, which reduced redundancy, but there were still certain efficiency problems because consecutive super - k - mers still overlapped k - 1 bases. ### Solution To further reduce redundancy, the author proposes hyper - k - mer. Hyper - k - mer is a new k - mer representation method that reduces the overlapping parts in super - k - mer by redefining the combination method of k - mers. Specifically, hyper - k - mer shares the overlapping parts in three consecutive super - k - mers, thereby significantly reducing the memory usage. ### Main Contributions 1. **Propose hyper - k - mer**: A new k - mer representation method that can reduce redundancy compared to super - k - mer. 2. **Theoretical analysis**: A theoretical analysis of the space efficiency of super - k - mer, syncmers, and hyper - k - mer was carried out, proving that hyper - k - mer can theoretically reduce the number of bits per nucleotide from 6 bits to 4 bits. 3. **Implement the KFC algorithm**: A new k - mer counting algorithm KFC was implemented based on hyper - k - mer. Experimental results show that KFC is superior to existing tools in both memory usage and running time, especially when the k - mer size is large. ### Experimental Results The author verified the performance of KFC through multiple experiments. The experimental results show that when processing long - read sequencing data, KFC is significantly superior to other tools in both memory usage and running time, especially when the k - mer size exceeds 200, the performance advantage of KFC is more significant. ### Conclusion By introducing hyper - k - mer, the author not only improves the memory efficiency of k - mer representation but also provides a new tool, KFC, which can perform excellently when processing large - scale genomic data. This provides important technical support for future bioinformatics research.