A comparison between Greengenes, SILVA, RDP, and NCBI reference databases in four published microbiota datasets
M. Severgnini,Camilla Ceccarani
DOI: https://doi.org/10.1101/2023.04.12.535864
2023-04-13
bioRxiv
Abstract:Inaccurate bacterial taxonomic assignment in 16S-based microbiota experiments could have deleterious effects on research results, as all downstream analyses heavily rely on the accurate assessment of microbial taxonomy: a bias in the choice of the reference database can deeply alter microbiota biodiversity (alpha-diversity), composition (beta-diversity), and taxa profile (bacterial relative abundances). In this paper, we explored the influence of the reference 16S rRNA collection by performing a classification against four of the main databases used by the scientific community (i.e. Greengenes, SILVA, RDP, NCBI); the consequences of database clustering at 97% were also explored. To investigate the effects of the database choice on real and representative microbiome samples from different ecosystems, we performed a comparative analysis on four already published datasets from various sources: stools from a mouse model experiment, bovine milk, human gut microbiota stool samples, and swabs from the human vaginal environment. We took into consideration the computational time needed to perform the taxonomic classification as well. Although values in both alpha- and beta-diversity varied a lot, sometimes even statistically, according to the dataset chosen and the eventual clustering, the final outcome of the analysis was a concordance in the capability to retrieve the original experimental group differences over the various datasets. However, in the taxonomy classification, we found several inconsistencies with taxonomies correctly assigned in only some of the four databases. The degree of concordance among the databases was related to both the complexity of the environment and its degree of completeness in the reference databases. IMPORTANCE 16S rRNA sequencing is, nowadays, the most commonly used strategy for microbiota profiling in many different ecosystems, ranging from human-associated to animal models, food matrices, and environmental samples. The ability of this kind of analysis to correctly capture differences in the microbiota composition is related to the taxonomic classification of the fragments obtained from sequencing and, thus, to the choice of the best reference database. This paper deals with four of the most popular microbial databases, which were evaluated in their ability to reproduce the experimental evidence from four already published datasets. The knowledge of the advantages and drawbacks of the database choice can be pivotal for planning future experiments in the field, making researchers aware of the repercussions of such a choice according to the different environments under scrutiny. Moreover, this work can also shed new light upon past results, partially explaining discordant evidence.
Computer Science,Biology,Environmental Science