PARMIK: PArtial Read Matching with Inexpensive K-mers

Morteza Baradaran,Ryan M Layer,Kevin Skadron
DOI: https://doi.org/10.1101/2024.10.14.618242
2024-10-17
Abstract:Environmental metagenomic sampling is instrumental in preparing for future pandemics by enabling early identification of potential pathogens and timely intervention strategies. Novel pathogens are a major concern, especially for zoonotic events. However, discovering novel pathogens often requires genome assembly, which remains a significant bottleneck. A robust metagenomic sampling that is directly searchable with new infection samples would give us a real-time understanding of outbreak origins dynamics. In this study, we propose PArtial Read Matching with Inexpensive K-mers (PARMIK), which is a search tool for efficiently identifying similar sequences from a patient sample (query) to a metagenomic sample (read). For example, at 90% identity between a query and a read, PARMIK surpassed BLAST, providing up to 21% higher recall. By filtering highly frequent k-mers, we reduced PARMIK's index size by over 50%. Moreover, PARMIK identified longer alignments faster than BLAST, peaking at 1.57x, when parallelizing across 32 cores.
Genomics
What problem does this paper attempt to address?