Addressing the dynamic nature of reference data: a new nt database for robust metagenomic classification

Jose Manuel Marti,Car Reen Kok,James B Thissen,Nisha J Mulakken,Aram Avila-Herrera,Crystal J Jaing,Jonathan E Allen,Nicholas A Be
DOI: https://doi.org/10.1101/2024.06.12.598617
2024-07-11
Abstract:Accurate metagenomic classification relies on comprehensive, up-to-date, and validated reference databases. While the NCBI BLAST Nucleotide (nt) database, encompassing a vast collection of sequences from all domains of life, represents an invaluable resource, its massive size -currently exceeding 10 to the 12 nucleotides- and exponential growth pose significant challenges for researchers seeking to maintain current nt-based indices for metagenomic classification. Recognizing that no nt-based indices exist for the widely used Centrifuge classifier, and the last public version was released in 2018, we addressed this critical gap. We present a new Centrifuge-compatible nt database, meticulously constructed using a novel pipeline incorporating different quality control measures, including reference decontamination and filtering. These measures demonstrably reduce spurious classifications, and through temporal comparisons, we reveal how this approach minimizes inconsistencies in taxonomic assignments stemming from asynchronous updates between public sequence and taxonomy databases. These discrepancies are particularly evident in taxa such as and , where classification accuracy varied significantly across database versions. This new database, made available as a pre-built Centrifuge index, responds to the need for an open, robust, nt-based pipeline for taxonomic classification in metagenomics. Applications such as environmental metagenomics, forensics, and clinical metagenomics, require comprehensive taxonomic coverage and will benefit from this resource. Our new nt-based index highlights the importance of treating reference databases as dynamic entities, subject to ongoing quality control and validation akin to software development best practices. This dynamic update approach is crucial for ensuring the accuracy and reliability of metagenomic analysis, especially as databases continue to expand in size and complexity.
Bioinformatics
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the following aspects: 1. **Challenges of dynamic reference data**: Accurate metagenomic classification depends on comprehensive, up - to - date, and validated reference databases. However, although the NCBI BLAST Nucleotide (nt) database contains a large number of sequences from all domains of life, its huge size (currently more than \(10^{12}\) nucleotides) and exponential growth pose significant challenges for researchers to maintain the current nt - based index. 2. **Lack of compatible classifier index**: Currently, there is no nt - based index compatible with the widely - used Centrifuge classifier, and the latest public version was released in 2018. This has led to a critical gap that needs to be filled. 3. **Classification inconsistency caused by asynchronous updates**: Asynchronous updates between public sequence and taxonomy databases lead to inconsistency in classification results, especially in certain specific taxa (such as *Listeria monocytogenes* and *Naegleria fowleri*), where the classification accuracy varies significantly between different database versions. To solve these problems, the authors propose a new Centrifuge - compatible nt database, which is carefully constructed through a new pipeline that includes multiple quality - control measures, including reference decontamination and filtering. These measures significantly reduce misclassification and, through temporal comparison, reveal how to minimize classification inconsistency caused by asynchronous updates of public sequence and taxonomy databases. In addition, the authors also emphasize the importance of regarding reference databases as dynamic entities, which require continuous quality control and validation, similar to best practices in software development. This dynamic - update method is crucial for ensuring the accuracy and reliability of metagenomic analysis, especially as databases continue to expand in size and complexity.