Abstract:One of the major methods to identify microbial community composition, to unravel microbial population dynamics, and to explore microbial diversity in environmental samples is high-throughput DNA- or RNA-based 16S rRNA (gene) amplicon sequencing in combination with bioinformatics analyses. However, focusing on environmental samples from contrasting habitats, it was not systematically evaluated (i) which analysis methods provide results that reflect reality most accurately, (ii) how the interpretations of microbial community studies are biased by different analysis methods and (iii) if the most optimal analysis workflow can be implemented in an easy-to-use pipeline. Here, we compared the performance of 16S rRNA (gene) amplicon sequencing analysis tools (i.e., Mothur, QIIME1, QIIME2, and MEGAN) using three mock datasets with known microbial community composition that differed in sequencing quality, species number and abundance distribution (i.e., even or uneven), and phylogenetic diversity (i.e., closely related or well-separated amplicon sequences). Our results showed that QIIME2 outcompeted all other investigated tools in sequence recovery (>10 times fewer false positives), taxonomic assignments (>22% better F-score) and diversity estimates (>5% better assessment), suggesting that this approach is able to reflect the in situ microbial community most accurately. Further analysis of 24 environmental datasets obtained from four contrasting terrestrial and freshwater sites revealed dramatic differences in the resulting microbial community composition for all pipelines at genus level. For instance, at the investigated river water sites Sphaerotilus was only reported when using QIIME1 (8% abundance) and Agitococcus with QIIME1 or QIIME2 (2 or 3% abundance, respectively), but both genera remained undetected when analyzed with Mothur or MEGAN. Since these abundant taxa probably have implications for important biogeochemical cycles (e.g., nitrate and sulfate reduction) at these sites, their detection and semi-quantitative enumeration is crucial for valid interpretations. A high-performance computing conformant workflow was constructed to allow FAIR (Findable, Accessible, Interoperable, and Re-usable) 16S rRNA (gene) amplicon sequence analysis starting from raw sequence files, using the most optimal methods identified in our study. Our presented workflow should be considered for future studies, thereby facilitating the analysis of high-throughput 16S rRNA (gene) sequencing data substantially, while maximizing reliability and confidence in microbial community data analysis.

A read-filtering algorithm for high-throughput marker-gene studies that greatly improves OTU accuracy

Subsampled Open-Reference Clustering Creates Consistent, Comprehensive OTU Definitions and Scales to Billions of Sequences

Two-Stage Clustering (Tsc): A Pipeline For Selecting Operational Taxonomic Units For The High-Throughput Sequencing Of Pcr Amplicons

Minimizing spurious features in 16S rRNA gene amplicon sequencing

From reads to operational taxonomic units: an ensemble processing pipeline for MiSeq amplicon sequencing data

Biootu: an Improved Method for Simultaneous Taxonomic Assignments and Operational Taxonomic Units Clustering of 16s Rrna Gene Sequences.

Error filtering, pair assembly and error correction for next-generation sequencing reads

Filtering ASVs/OTUs via mutual information-based microbiome network analysis

A novel ultra high-throughput 16S rRNA gene amplicon sequencing library preparation method for the Illumina HiSeq platform

Interpretations of Environmental Microbial Community Studies Are Biased by the Selected 16S rRNA (Gene) Amplicon Sequencing Pipeline

Hybrid-denovo: a de novo OTU-picking pipeline integrating single-end and paired-end 16S sequence tags

Accurate Reconstruction of Microbial Strains from Metagenomic Sequencing Using Representative Reference Genomes

miQC: An adaptive probabilistic framework for quality control of single-cell RNA-sequencing data

Efficient Frequency-Based De Novo Short-Read Clustering for Error Trimming in Next-Generation Sequencing

Next-generation data filtering in the genomics era

Turn ‘noise’ to signal: accurately rectify millions of erroneous short reads through graph learning on edit distances

The effect of low-abundance OTU filtering methods on the reliability and variability of microbial composition assessed by 16S rRNA amplicon sequencing

SimpleMetaPipeline: Breaking the bioinformatics bottleneck in metabarcoding

Fast and Simple Analysis of MiSeq Amplicon Sequencing Data with MetaAmp.

Sequencing Introduced False Positive Rare Taxa Lead to Biased Microbial Community Diversity, Assembly, and Interaction Interpretation in Amplicon Studies

MeFiT: merging and filtering tool for illumina paired-end reads for 16S rRNA amplicon sequencing