Abstract:Background Our understanding of the composition, function, and health implications of human microbiota has been advanced by high-throughput sequencing and the development of new genomic analyses. However, trade-offs among alternative strategies for the acquisition and analysis of sequence data remain understudied. Methods We assessed eight popular taxonomic profiling pipelines; MetaPhlAn2, metaMix, PathoScope 2.0, Sigma, Kraken, ConStrains, Centrifuge and Taxator-tk, against a battery of metagenomic datasets simulated from real data. The metagenomic datasets were modeled on 426 complete or permanent draft genomes stored in the Human Oral Microbiome Database and were designed to simulate various experimental conditions, both in the design of a putative experiment; read length (75–1,000 bp reads), sequence depth (100K–10M), and in metagenomic composition; number of species present (10, 100, 426), species distribution. The sensitivity and specificity of each of the pipelines under various scenarios were measured. We also estimated the relative root mean square error and average relative error to assess the abundance estimates produced by different methods. Additional datasets were generated for five of the pipelines to simulate the presence within a metagenome of an unreferenced species, closely related to other referenced species. Additional datasets were also generated in order to measure computational time on datasets of ever-increasing sequencing depth (up to 6 × 10 7 ). Results Testing of eight pipelines against 144 simulated metagenomic datasets initially produced 1,104 discrete results. Pipelines using a marker gene strategy; MetaPhlAn2 and ConStrains, were overall less sensitive, than other pipelines; with the notable exception of Taxator-tk. This difference in sensitivity was largely made up in terms of runtime, significantly lower than more sensitive pipelines that rely on whole-genome alignments such as PathoScope2.0. However, pipelines that used strategies to speed-up alignment between genomic references and metagenomic reads, such as kmerization, were able to combine both high sensitivity and low run time, as is the case with Kraken and Centrifuge. Absent species genomes in the database mostly led to assignment of reads to the most closely related species available in all pipelines. Our results therefore suggest that taxonomic profilers that use kmerization have largely superseded those that use gene markers, coupling low run times with high sensitivity and specificity. Taxonomic profilers using more time-consuming read reassignment, such as PathoScope 2.0, provided the most sensitive profiles under common metagenomic sequencing scenarios. All the results described and discussed in this paper can be visualized using the dedicated R Shiny application ( https://github.com/microgenomics/HumanMicrobiomeAnalysis ). All of our datasets, pipelines and results are made available through the GitHub repository for future benchmarking.

Assessing 16S marker gene survey data analysis methods using mixtures of human stool sample DNA extracts.

A Framework for Assessing 16S Rrna Marker-Gene Survey Data Analysis Methods Using Mixtures.

Interpretations of Environmental Microbial Community Studies Are Biased by the Selected 16S rRNA (Gene) Amplicon Sequencing Pipeline

Evaluation of computational methods for human microbiome analysis using simulated data

Sampling and pyrosequencing methods for characterizing bacterial communities in the human gut using 16S sequence tags

Benchmarking of 16S rRNA gene databases using known strain sequences

Multi-amplicon microbiome data analysis pipelines for mixed orientation sequences using QIIME2: Assessing reference database, variable region and pre-processing bias in classification of mock bacterial community samples

Assessment of statistical methods from single cell, bulk RNA-seq, and metagenomics applied to microbiome data

Comparing bioinformatic pipelines for microbial 16S rRNA amplicon sequencing

A multi-amplicon 16S rRNA sequencing and analysis method for improved taxonomic profiling of bacterial communities

Evaluating the accuracy of amplicon-based microbiome computational pipelines on simulated human gut microbial communities

High Throughput and Quantitative Measurement of Microbial Metabolome by Gas Chromatography/Mass Spectrometry Using Automated Alkyl Chloroformate Derivatization.

Impact of sequence variant detection and bacterial DNA extraction methods on the measurement of microbial community composition in human stool

Accurate quantitation of 16S gene copies in low biomass samples post-antibiotic treatment through deep sequencing with a balanced nucleotide synthetic spike-in approach

Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments

Toward Standards in Clinical Microbiota Studies: Comparison of Three DNA Extraction Methods and Two Bioinformatic Pipelines

Primer, Pipelines, Parameters: Issues in 16S rRNA Gene Sequencing

An independent evaluation in a CRC patient cohort of microbiome 16S rRNA sequence analysis methods: OTU clustering, DADA2, and Deblur

Optimization of the 16S Rrna Sequencing Analysis Pipeline for Studying in Vitro Communities of Gut Commensals.

Otu Analysis Using Metagenomic Shotgun Sequencing Data

Assessing the Fecal Microbiota: An Optimized Ion Torrent 16S rRNA Gene-Based Analysis Protocol