DAIRYdb: A manually curated gold standard reference database for improved taxonomy annotation of 16S rRNA gene sequences from dairy products

Marco Meola,Etienne Rifa,Noam Shani,Céline Delbès,Hélène Berthoud,Christophe Chassard
DOI: https://doi.org/10.1101/386151
2018-08-09
Abstract:Reads assignment to taxonomic units is a key step in microbiome analysis pipelines. To date, accurate taxonomy annotation, particularly at species rank, is still challenging due to the short size of read sequences and differently curated classification databases. However, the close phylogenetic relationship between species encountered in dairy products requires accurate species annotation to achieve sufficient phylogenetic resolution for further downstream ecological studies or for food diagnostics. Taxonomy annotation in universal 16S databases with environmental sequences like Silva, RDP or Greengenes is based on predictions rather than on studies of type strains or isolates. We provide a manually curated database composed of 10’290 full-length 16S rRNA gene sequences from prokaryotes tailored for dairy products analysis ( https://github.com/marcomeola/DAIRYdb ). The performance of the DAIRYdb was compared with the universal databases Silva, LTP, RDP and Greengenes. The DAIRYdb significantly outperformed all other databases independently of the classification algorithm by enabling higher accurate taxonomy annotation down to the species rank. The DAIRYdb accurately annotates over 90% of the sequences of either single or paired hypervariable regions automatically. The manually curated DAIRYdb strongly improves taxonomic classification accuracy for microbiome studies in dairy environments. The DAIRYdb is a practical solution that enables automatization of this key step, thus facilitating the routine application of NGS microbiome analyses for microbial ecology studies and diagnostics in dairy products.
What problem does this paper attempt to address?