Abstract:Background Sequence signatures, as defined by the frequencies of k -tuples (or k -mers, k -grams), have been used extensively to compare genomic sequences of individual organisms, to identify cis -regulatory modules, and to study the evolution of regulatory sequences. Recently many next-generation sequencing (NGS) read data sets of metagenomic samples from a variety of different environments have been generated. The assembly of these reads can be difficult and analysis methods based on mapping reads to genes or pathways are also restricted by the availability and completeness of existing databases. Sequence-signature-based methods, however, do not need the complete genomes or existing databases and thus, can potentially be very useful for the comparison of metagenomic samples using NGS read data. Still, the applications of sequence signature methods for the comparison of metagenomic samples have not been well studied. Results We studied several dissimilarity measures, including d 2 , d 2 * and d 2 S recently developed from our group, a measure (hereinafter noted as Hao ) used in CVTree developed from Hao’s group (Qi et al ., 2004), measures based on relative di-, tri-, and tetra-nucleotide frequencies as in Willner et al . (2009), as well as standard l p measures between the frequency vectors, for the comparison of metagenomic samples using sequence signatures. We compared their performance using a series of extensive simulations and three real next-generation sequencing (NGS) metagenomic datasets: 39 fecal samples from 33 mammalian host species, 56 marine samples across the world, and 13 fecal samples from human individuals. Results showed that the dissimilarity measure d 2 S can achieve superior performance when comparing metagenomic samples by clustering them into different groups as well as recovering environmental gradients affecting microbial samples. New insights into the environmental factors affecting microbial compositions in metagenomic samples are obtained through the analyses. Our results show that sequence signatures of the mammalian gut are closely associated with diet and gut physiology of the mammals, and that sequence signatures of marine communities are closely related to location and temperature. Conclusions Sequence signatures can successfully reveal major group and gradient relationships among metagenomic samples from NGS reads without alignment to reference databases. The d 2 S dissimilarity measure is a good choice in all application scenarios. The optimal choice of tuple size depends on sequencing depth, but it is quite robust within a range of choices for moderate sequencing depths.

Effect of k-tuple length on sample-comparison with high-throughput sequencing data

Comparison of Microbial Diversity Determined with the Same Variable Tag Sequence Extracted from Two Different PCR Amplicons

Comparison of Metatranscriptomic Samples Based on K-Tuple Frequencies

Comparison of Metagenomic Samples Using Sequence Signatures

Multiple Alignment-Free Sequence Comparison

Multiple Comparative Metagenomics using Multiset k-mer Counting

Otu Analysis Using Metagenomic Shotgun Sequencing Data

Alignment-free Transcriptomic and Metatranscriptomic Comparison Using Sequencing Signatures with Variable Length Markov Chains

Comparative analysis of metagenomic classifiers for long-read sequencing datasets

An efficient parallel algorithm for multiple sequence similarities calculation using a low complexity method.

Alignment-Free Sequence Analysis and Applications

Alignment-Free Sequence Comparison Based on Next Generation Sequencing Reads: Extended Abstract.

New Developments Of Alignment-Free Sequence Comparison: Measures, Statistics And Next-Generation Sequencing

Exploration and retrieval of whole-metagenome sequencing samples

Computational Methods for the Analysis of Tag Sequences in Metagenomics Studies.

TahcoRoll: An Efficient Approach for Signature Profiling in Genomic Data through Variable-Length k-mers

New Features or Metric on Sequence Comparison

Estimating the total genome length of a metagenomic sample using k-mers

A Benchmark of Genetic Variant Calling Pipelines Using Metagenomic Short-Read Sequencing

Alignment-free sequence comparison based on next-generation sequencing reads.

CRAFT: Compact genome Representation toward large-scale Alignment-Free daTabase