MultiStageSearch: a multi-step proteogenomic workflow for taxonomic identification of viral proteome samples adressing database bias

Julian Pipart,Tanja Holstein,Lennart Martens,Thilo Muth
DOI: https://doi.org/10.1101/2024.05.15.594287
2024-05-20
Abstract:The recent years, with the global SARS-Cov-2 pandemic, have shown the importance of strain level identification of viral pathogens. While the gold-standard approach for unknown viral sample identification remains genomics, studies have shown the necessity and advantages of orthogonal experimental approaches such as proteomics, based on proteomic database search methods. The databases required as references for both proteins and genome sequences are known to be biased towards certain taxa, such as pathogenic strains or species, or common model organisms. Additionally, the proteomic databases are not as comprehensive as the genomic databases. We present MultiStageSearch, an iterative database search approach for the taxonomic identification of viral samples combining proteomic and genomic databases. The potentially present species and strains are inferred using a generalist proteomic reference database. MultiStageSearch then automatically creates a proteogenomic database. This database is further pre-processed by filtering for duplicates as well as clustering of identical ORFs to address potential bias present in the genomic database. Furthermore, the workflow is independent of the strain level NCBI taxonomy, enabling the inference of strains that are not present in the NCBI taxonomy. We performed a benchmark on several viral samples to demonstrate the performance of the strain level taxonomic inference. The benchmark shows superior performance com- pared to state of the art methods for untargeted strain level inference using proteomic data while being independent of the NCBI taxonomy at strain level.
Bioinformatics
What problem does this paper attempt to address?