Abstract:k -mer profiling has been one of the trending approaches to analyze read data generated by high-throughput sequencing technologies. The tasks of k -mer profiling include, but are not limited to, counting the frequencies and determining the occurrences of short sequences in a dataset. The notion of k -mer has been extensively used to build de Bruijn graphs in genome or transcriptome assembly, which requires examining all possible k -mers presented in the dataset. Recently, an alternative way of profiling has been proposed, which constructs a set of representative k -mers as genomic markers and profiles their occurrences in the sequencing data. This technique has been applied in both transcript quantification through RNA-Seq and taxonomic classification of metagenomic reads. Most of these applications use a set of fixed-size k -mers since the majority of existing k -mer counters are inadequate to process genomic sequences with variable-length k -mers. However, choosing the appropriate k is challenging, as it varies for different applications. As a pioneer work to profile a set of variable-length k -mers, we propose TahcoRoll in order to enhance the Aho-Corasick algorithm. More specifically, we use one bit to represent each nucleotide, and integrate the rolling hash technique to construct an efficient in-memory data structure for this task. Using both synthetic and real datasets, results show that TahcoRoll outperforms existing approaches in either or both time and memory efficiency without using any disk space. In addition, compared to the most efficient state-of-the-art k -mer counters, such as KMC and MSBWT, TahcoRoll is the only approach that can process long read data from both PacBio and Oxford Nanopore on a commodity desktop computer. The source code of TahcoRoll is implemented in C++14, and available at <https://github.com/chelseaju/TahcoRoll.git>.

CHTKC: a robust and efficient k-mer counting algorithm based on a lock-free chaining hash table

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

KCOSS: an ultra-fast k-mer counter for assembled genome analysis

Kmcex: Memory-Frugal and Retrieval-Efficient Encoding of Counted K-Mers.

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure

High‐frequency K‐mer Counting at Low Memory Footprint

Efficient Mining Closed K-Mers from DNA and Protein Sequences

Algorithms for Biological Sequence K-mer Frequency Counting Problem

KMC 2: Fast and resource-frugal $k$-mer counting

MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting

High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism

TopKmer: Parallel High Frequency K-mer Counting on Distributed Memory

Research on Counting Algorithm of K-Mer Occurrence in DNA Sequence

Space-efficient computation of k-mer dictionaries for large values of k

KmerCo: A lightweight K-mer counting technique with a tiny memory footprint

Seeding with Minimized Subsequence.

K-mer Counting: Memory-Efficient Strategy, Parallel Computing and Field of Application for Bioinformatics

TahcoRoll: An Efficient Approach for Signature Profiling in Genomic Data through Variable-Length k-mers

An efficient parallel algorithm for multiple sequence similarities calculation using a low complexity method.

Efficient Seeding for Error-Prone Sequences with SubseqHash2

Hyper-k-mers: efficient streaming k-mers representation