V- and VL-Scores Uncover Viral Signatures and Origins of Protein Families
Kun Zhou,James C Kosmopoulos,Etan Dieppa Colon,Peter John Badciong,Karthik Anantharaman
DOI: https://doi.org/10.1101/2024.10.24.619987
2024-10-29
Abstract:Viruses are key drivers of microbial diversity, nutrient cycling, and co-evolution in ecosystems, yet their study is hindered due to challenges in culturing. Traditional gene-centric methods, which focus on a few hallmark genes like for capsids, miss much of the viral genome, leaving key viral proteins and functions undiscovered. Here, we introduce two powerful annotation-free metrics, V-score and VL-score, designed to quantify the virus-likeness of protein families and genomes and create an open-access searchable database, V-Score-Search. By applying V- and VL-scores to public databases (KEGG, Pfam, and eggNOG), we link 38−77% of protein families with viruses, a 9−16x increase over current estimates. These metrics outperform existing approaches, enabling precise detection of viral genomes, prophages, and host-derived auxiliary viral genes (AVGs) from fragmented sequences, and significantly improving genome binning. Remarkably, we identify up to 17x more AVGs, dominated by non-metabolic proteins of unknown function. This innovation unlocks new insights into virus signatures and host interactions, with wide-ranging implications from genomics to biotechnology.
Bioinformatics