Abstract:K-mers have become ubiquitous in modern bioinformatics pipelines. A key factor in their success is the ability to filter out erroneous k-mers by removing those with low abundance. However, the vast number of distinct k-mers makes k-mer counting a significant resource bottleneck. Early tools addressed this issue by storing k-mers on disk. To mitigate the excessive redundancy caused by overlapping k-mers, super-k-mers were introduced, significantly decreasing memory usage. Nevertheless, consecutive super-k-mers still overlap by k − 1 bases, leading to some degree of inefficiency. In this work, we introduce hyper-k-mers, a novel approach that further reduces redundancy. Our contributions are three-fold. First, we propose hyper-k-mers, a new representation of k-mers that asymptotically decreases duplication compared to super-k-mers. Second, we present a theoretical analysis comparing the space efficiency of super-k-mers, syncmers, and hyper-k-mers. Our approach offers significant advantages by reducing the asymptotic lower bound from 6 bits per nucleotide for super-k-mers to 4 bits per kmer. Third, we present KFC, a k-mer counting algorithm that leverages hyper-k-mers. KFC offers significant practical advantages, including an order of magnitude improvement in memory usage compared to state-of-the-art tools. Notably, our experiments show that KFC is the only tool whose resource usage does not scale linearly with the size of k and is the fastest option for large k-mer sizes. Our tool KFC is open-source under AGPL3 and available at https://github.com/lrobidou/KFC along with the experiments scripts at https://github.com/imartayan/KFC_experiments.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the memory efficiency of k - mer representation, especially when dealing with large - scale genomic data. Specifically, the author proposes a new k - mer representation method, hyper - k - mer, aiming to reduce the redundancy in super - k - mer (super - k - mer), thereby further reducing the memory usage. ### Background and Problem In modern bioinformatics, k - mer (a substring of length k) has become a standard tool for processing DNA sequences. The main advantage of k - mer is that it can quickly estimate sequence similarity and can handle duplicate data naturally. However, as the number of k - mers increases, storing these k - mers becomes very memory - consuming. Early methods solved this problem by storing k - mers on disk, but this method was inefficient. Later, super - k - mers (super - k - mers) were formed by partially recombining overlapping k - mers, which reduced redundancy, but there were still certain efficiency problems because consecutive super - k - mers still overlapped k - 1 bases. ### Solution To further reduce redundancy, the author proposes hyper - k - mer. Hyper - k - mer is a new k - mer representation method that reduces the overlapping parts in super - k - mer by redefining the combination method of k - mers. Specifically, hyper - k - mer shares the overlapping parts in three consecutive super - k - mers, thereby significantly reducing the memory usage. ### Main Contributions 1. **Propose hyper - k - mer**: A new k - mer representation method that can reduce redundancy compared to super - k - mer. 2. **Theoretical analysis**: A theoretical analysis of the space efficiency of super - k - mer, syncmers, and hyper - k - mer was carried out, proving that hyper - k - mer can theoretically reduce the number of bits per nucleotide from 6 bits to 4 bits. 3. **Implement the KFC algorithm**: A new k - mer counting algorithm KFC was implemented based on hyper - k - mer. Experimental results show that KFC is superior to existing tools in both memory usage and running time, especially when the k - mer size is large. ### Experimental Results The author verified the performance of KFC through multiple experiments. The experimental results show that when processing long - read sequencing data, KFC is significantly superior to other tools in both memory usage and running time, especially when the k - mer size exceeds 200, the performance advantage of KFC is more significant. ### Conclusion By introducing hyper - k - mer, the author not only improves the memory efficiency of k - mer representation but also provides a new tool, KFC, which can perform excellently when processing large - scale genomic data. This provides important technical support for future bioinformatics research.

Hyper-k-mers: efficient streaming k-mers representation

Kmcex: Memory-Frugal and Retrieval-Efficient Encoding of Counted K-Mers.

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure

KMC 2: Fast and resource-frugal $k$-mer counting

KmerCo: A lightweight K-mer counting technique with a tiny memory footprint

Space-efficient representation of genomic k-mer count tables

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting

Space-efficient computation of k-mer dictionaries for large values of k

Kmerlight: fast and accurate k-mer abundance estimation

Brisk: Exact resource-efficient dictionary for k-mers

Memory-bound k-mer selection for large and evolutionary diverse reference libraries

TopKmer: Parallel High Frequency K-mer Counting on Distributed Memory

Efficient Storage and Analysis of Genomic Data: A k-mer Frequency Mapping and Image Representation Method

KCOSS: an ultra-fast k-mer counter for assembled genome analysis

High‐frequency K‐mer Counting at Low Memory Footprint

A Survey of K-mer Methods and Applications in Bioinformatics

K-mer Counting: Memory-Efficient Strategy, Parallel Computing and Field of Application for Bioinformatics

Efficient Mining Closed K-Mers from DNA and Protein Sequences

High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism

Gerbil: A Fast and Memory-Efficient $k$-mer Counter with GPU-Support