Abstract:Inaccurate bacterial taxonomic assignment in 16S-based microbiota experiments could have deleterious effects on research results, as all downstream analyses heavily rely on the accurate assessment of microbial taxonomy: a bias in the choice of the reference database can deeply alter microbiota biodiversity (alpha-diversity), composition (beta-diversity), and taxa profile (bacterial relative abundances). In this paper, we explored the influence of the reference 16S rRNA collection by performing a classification against four of the main databases used by the scientific community (i.e. Greengenes, SILVA, RDP, NCBI); the consequences of database clustering at 97% were also explored. To investigate the effects of the database choice on real and representative microbiome samples from different ecosystems, we performed a comparative analysis on four already published datasets from various sources: stools from a mouse model experiment, bovine milk, human gut microbiota stool samples, and swabs from the human vaginal environment. We took into consideration the computational time needed to perform the taxonomic classification as well. Although values in both alpha- and beta-diversity varied a lot, sometimes even statistically, according to the dataset chosen and the eventual clustering, the final outcome of the analysis was a concordance in the capability to retrieve the original experimental group differences over the various datasets. However, in the taxonomy classification, we found several inconsistencies with taxonomies correctly assigned in only some of the four databases. The degree of concordance among the databases was related to both the complexity of the environment and its degree of completeness in the reference databases. IMPORTANCE 16S rRNA sequencing is, nowadays, the most commonly used strategy for microbiota profiling in many different ecosystems, ranging from human-associated to animal models, food matrices, and environmental samples. The ability of this kind of analysis to correctly capture differences in the microbiota composition is related to the taxonomic classification of the fragments obtained from sequencing and, thus, to the choice of the best reference database. This paper deals with four of the most popular microbial databases, which were evaluated in their ability to reproduce the experimental evidence from four already published datasets. The knowledge of the advantages and drawbacks of the database choice can be pivotal for planning future experiments in the field, making researchers aware of the repercussions of such a choice according to the different environments under scrutiny. Moreover, this work can also shed new light upon past results, partially explaining discordant evidence.

GSR-DB: a manually curated and optimized taxonomical database for 16S rRNA amplicon analysis

16S-ITGDB: An Integrated Database for Improving Species Classification of Prokaryotic 16S Ribosomal RNA Sequences

Improving Species Level‐taxonomic Assignment from 16S rRNA Sequencing Technologies

Mining underutilized whole-genome sequencing projects to improve 16S rRNA databases

DAIRYdb: A manually curated gold standard reference database for improved taxonomy annotation of 16S rRNA gene sequences from dairy products

GROND: a quality-checked and publicly available database of full-length 16S-ITS-23S rRNA operon sequences

DAIRYdb: a manually curated reference database for improved taxonomy annotation of 16S rRNA gene sequences from dairy products

A comparison between Greengenes, SILVA, RDP, and NCBI reference databases in four published microbiota datasets

A multi-amplicon 16S rRNA sequencing and analysis method for improved taxonomic profiling of bacterial communities

Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB

Generation of Comprehensive Ecosystem-Specific Reference Databases with Species-Level Resolution by High-Throughput Full-Length 16S rRNA Gene Sequencing and Automated Taxonomy Assignment (AutoTax)

Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys

Construction of habitat-specific training sets to achieve species-level assignment in 16S rRNA gene datasets

CABO-16S : A Combined Archaea, Bacteria, Organelle 16S database for amplicon analysis of prokaryotes and eukaryotes in environmental samples

Evaluation of 16S rRNA gene sequencing for species and strain-level microbiome analysis

GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy

BSRD: a repository for bacterial small regulatory RNA.

Mining NCBI Sequence Read Archive Database: An Untapped Source of Organelle Genomes for Taxonomic and Comparative Genomics Research

The SILVA and "All-species Living Tree Project (LTP)" taxonomic frameworks

Metagenomic profiling pipelines improve taxonomic classification for 16S amplicon sequencing data

Influence of 16S rRNA reference databases in amplicon-based environmental microbiome research