Addressing the dynamic nature of reference data: a new nt database for robust metagenomic classification
Jose Manuel Marti,Car Reen Kok,James B Thissen,Nisha J Mulakken,Aram Avila-Herrera,Crystal J Jaing,Jonathan E Allen,Nicholas A Be
DOI: https://doi.org/10.1101/2024.06.12.598617
2024-07-11
Abstract:Accurate metagenomic classification relies on comprehensive, up-to-date, and validated reference databases. While the NCBI BLAST Nucleotide (nt) database, encompassing a vast collection of sequences from all domains of life, represents an invaluable resource, its massive size -currently exceeding 10 to the 12 nucleotides- and exponential growth pose significant challenges for researchers seeking to maintain current nt-based indices for metagenomic classification. Recognizing that no nt-based indices exist for the widely used Centrifuge classifier, and the last public version was released in 2018, we addressed this critical gap.
We present a new Centrifuge-compatible nt database, meticulously constructed using a novel pipeline incorporating different quality control measures, including reference decontamination and filtering. These measures demonstrably reduce spurious classifications, and through temporal comparisons, we reveal how this approach minimizes inconsistencies in taxonomic assignments stemming from asynchronous updates between public sequence and taxonomy databases. These discrepancies are particularly evident in taxa such as and , where classification accuracy varied significantly across database versions.
This new database, made available as a pre-built Centrifuge index, responds to the need for an open, robust, nt-based pipeline for taxonomic classification in metagenomics. Applications such as environmental metagenomics, forensics, and clinical metagenomics, require comprehensive taxonomic coverage and will benefit from this resource. Our new nt-based index highlights the importance of treating reference databases as dynamic entities, subject to ongoing quality control and validation akin to software development best practices. This dynamic update approach is crucial for ensuring the accuracy and reliability of metagenomic analysis, especially as databases continue to expand in size and complexity.
Bioinformatics