Abstract:k -mer profiling has been one of the trending approaches to analyze read data generated by high-throughput sequencing technologies. The tasks of k -mer profiling include, but are not limited to, counting the frequencies and determining the occurrences of short sequences in a dataset. The notion of k -mer has been extensively used to build de Bruijn graphs in genome or transcriptome assembly, which requires examining all possible k -mers presented in the dataset. Recently, an alternative way of profiling has been proposed, which constructs a set of representative k -mers as genomic markers and profiles their occurrences in the sequencing data. This technique has been applied in both transcript quantification through RNA-Seq and taxonomic classification of metagenomic reads. Most of these applications use a set of fixed-size k -mers since the majority of existing k -mer counters are inadequate to process genomic sequences with variable-length k -mers. However, choosing the appropriate k is challenging, as it varies for different applications. As a pioneer work to profile a set of variable-length k -mers, we propose TahcoRoll in order to enhance the Aho-Corasick algorithm. More specifically, we use one bit to represent each nucleotide, and integrate the rolling hash technique to construct an efficient in-memory data structure for this task. Using both synthetic and real datasets, results show that TahcoRoll outperforms existing approaches in either or both time and memory efficiency without using any disk space. In addition, compared to the most efficient state-of-the-art k -mer counters, such as KMC and MSBWT, TahcoRoll is the only approach that can process long read data from both PacBio and Oxford Nanopore on a commodity desktop computer. The source code of TahcoRoll is implemented in C++14, and available at <https://github.com/chelseaju/TahcoRoll.git>.

Taxonomic classification with maximal exact matches in KATKA kernels and minimizer digests

Memory-bound k-mer selection for large evolutionary diverse reference libraries

Memory-bound k-mer selection for large and evolutionary diverse reference libraries

Kraken: ultrafast metagenomic sequence classification using exact alignments

Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT

MEM-based pangenome indexing for k-mer queries

Taxator-tk: Fast and Precise Taxonomic Assignment of Metagenomes by Approximating Evolutionary Neighborhoods

Taxator-tk: precise taxonomic assignment of metagenomes by fast approximation of evolutionary neighborhoods

Matchtigs: minimum plain text representation of k-mer sets

KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies

Fast and sensitive taxonomic classification for metagenomics with Kaiju

Real-time Taxonomic Characterization of Long-read Mixed-species Sequencing Samples in Sorted Motif Distance Space:

Resource saving taxonomy classification with k-mer distributions and machine learning

MTSv: rapid alignment-based taxonomic classification and high-confidence metagenomic analysis

Fast and space-efficient taxonomic classification of long reads with hierarchical interleaved XOR filters

How to Find Long Maximal Exact Matches and Ignore Short Ones

KMC 2: Fast and resource-frugal $k$-mer counting

CONSULT-II: accurate taxonomic identification and profiling using locality-sensitive hashing

Hyper-k-mers: efficient streaming k-mers representation

copMEM: Finding maximal exact matches via sampling both genomes

TahcoRoll: An Efficient Approach for Signature Profiling in Genomic Data through Variable-Length k-mers