MetaKSSD: Boosting the Scalability of Reference Taxonomic Marker Database and the Performance of Metagenomic Profiling Using Sketch Operations

Huiguang Yi
DOI: https://doi.org/10.1101/2024.06.21.600011
2024-07-03
Abstract:The rapid increase in genomes and metagenomic data presents major scalability and efficiency challenges for current metagenomic profilers. In response, we introduce MetaKSSD, which redefines reference taxonomic marker database (MarkerDB) construction and metagenomic profiling using sketch operations, offering efficiency improvements by orders of magnitude. MetaKSSD encompasses 85,202 species in its MarkerDB using just 0.17GB of storage and profiles 10GB of data within seconds, utilizing only 0.5GB of memory. Extensive benchmarking experiments demonstrated that MetaKSSD is among the top-performing profilers across various metrics. In a microbiome-phenotype association study, MetaKSSD identified significantly more effective associations than MetaPhlAn4. We profiled 382,016 metagenomic runs using MetaKSSD, conducted extensive sample clustering analyses, and suggested potential yet-to-be-discovered niches. Additionally, we developed functionality in MetaKSSD for instantaneous searching among large-scale profiles. The client-server architecture of MetaKSSD allows the swift transmission of metagenome sketches over the network and enables real-time online metagenomic analysis, facilitating use by non-expert users.
Bioinformatics
What problem does this paper attempt to address?