Abstract:ABSTRACT Understanding the ecological impacts of viruses on natural and engineered ecosystems relies on the accurate identification of viral sequences from community sequencing data. To maximize viral recovery from metagenomes, researchers frequently combine viral identification tools. However, the effectiveness of this strategy is unknown. Here, we benchmarked combinations of six widely used informatics tools for viral identification and analysis (VirSorter, VirSorter2, VIBRANT, DeepVirFinder, CheckV, and Kaiju), called “rulesets.” Rulesets were tested against mock metagenomes composed of taxonomically diverse sequence types and diverse aquatic metagenomes to assess the effects of the degree of viral enrichment and habitat on tool performance. We found that six rulesets achieved equivalent accuracy [Matthews Correlation Coefficient (MCC) = 0.77, P adj ≥ 0.05]. Each contained VirSorter2, and five used our “tuning removal” rule designed to remove non-viral contamination. While DeepVirFinder, VIBRANT, and VirSorter were each found once in these high-accuracy rulesets, they were not found in combination with each other: combining tools does not lead to optimal performance. Our validation suggests that the MCC plateau at 0.77 is partly caused by inaccurate labeling within reference sequence databases. In aquatic metagenomes, our highest MCC ruleset identified more viral sequences in virus-enriched (44%–46%) than in cellular metagenomes (7%–19%). While improved algorithms may lead to more accurate viral identification tools, this should be done in tandem with careful curation of sequence databases. We recommend using the VirSorter2 ruleset and our empirically derived tuning removal rule. Our analysis provides insight into methods for in silico viral identification and will enable more robust viral identification from metagenomic data sets. IMPORTANCE The identification of viruses from environmental metagenomes using informatics tools has offered critical insights in microbial ecology. However, it remains difficult for researchers to know which tools optimize viral recovery for their specific study. In an attempt to recover more viruses, studies are increasingly combining the outputs from multiple tools without validating this approach. After benchmarking combinations of six viral identification tools against mock metagenomes and environmental samples, we found that these tools should only be combined cautiously. Two to four tool combinations maximized viral recovery and minimized non-viral contamination compared with either the single-tool or the five- to six-tool ones. By providing a rigorous overview of the behavior of in silico viral identification strategies and a pipeline to replicate our process, our findings guide the use of existing viral identification tools and offer a blueprint for feature engineering of new tools that will lead to higher-confidence viral discovery in microbiome studies.

MultiStageSearch: a multi-step proteogenomic workflow for taxonomic identification of viral proteome samples adressing database bias

Microseek: A Protein-Based Metagenomic Pipeline for Virus Diagnostic and Discovery

vPro-MS enables identification of human-pathogenic viruses from patient samples by untargeted proteomics

Refining SARS-CoV-2 Intra-host Variation by Leveraging Large-scale Sequencing Data

Unveiling Inter- and Intra-Patient Sequence Variability with a Multi-Sample Coronavirus Target Enrichment Approach

Scvi-Tools: a Library for Deep Probabilistic Analysis of Single-Cell Omics Data

Targeted Virome Sequencing Enhances Unbiased Detection and Genome Assembly of Known and Emerging Viruses—The Example of SARS-CoV-2

Benchmarking informatics approaches for virus discovery: caution is needed when combining in silico identification methods

AliMarko: A Novel Tool for Eukaryotic Virus Identification Using Expert-Guided Approach

Clinical PathoScope: rapid alignment and filtration for accurate pathogen identification in clinical samples using unassembled sequencing data

VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data

Capturing sequence diversity in metagenomes with comprehensive and scalable probe design

Validation of an unbiased metagenomic detection assay for RNA viruses in viral transport media and plasma

Petabase-scale sequence alignment catalyses viral discovery

Comparison of the performance of two targeted metagenomic virus capture probe-based methods using reference control materials and clinical samples

Multi-amplicon microbiome data analysis pipelines for mixed orientation sequences using QIIME2: Assessing reference database, variable region and pre-processing bias in classification of mock bacterial community samples

Advancing microbial diagnostics: a universal phylogeny guided computational algorithm to find unique sequences for precise microorganism detection

De-heterogeneity of the eukaryotic viral reference database (EVRD) improves the accuracy and efficiency of viromic analysis

A multi-amplicon 16S rRNA sequencing and analysis method for improved taxonomic profiling of bacterial communities

High-sensitivity whole-genome recovery of single viral species in environmental samples.

LABRADOR—A Computational Workflow for Virus Detection in High-Throughput Sequencing Data