VirusPredictor: XGBoost-based software to predict virus-related sequences in human data

Guangchen Liu,Xun Chen,Yihui Luan,Dawei Li

DOI: https://doi.org/10.1093/bioinformatics/btae192

IF: 5.8

2024-03-29

Bioinformatics

Abstract:Abstract Motivation Discovering disease causative pathogens, particularly viruses without reference genomes, poses a technical challenge as they are often unidentifiable through sequence alignment. Machine learning prediction of patient high-throughput sequences unmappable to human and pathogen genomes may reveal sequences originating from uncharacterized viruses. Currently, there is a lack of software specifically designed for accurately predicting such viral sequences in human data. Results We developed a fast XGBoost method and software VirusPredictor leveraging an in-house viral genome database. Our two-step XGBoost models first classify each query sequence into one of three groups: infectious virus, endogenous retrovirus (ERV) or non-ERV human. The prediction accuracies increased as the sequences became longer, i.e. 0.76, 0.93, and 0.98 for 150–350 (Illumina short reads), 850–950 (Sanger sequencing data), and 2000–5000 bp sequences, respectively. Then, sequences predicted to be from infectious viruses are further classified into one of six virus taxonomic subgroups, and the accuracies increased from 0.92 to >0.98 when query sequences increased from 150–350 to >850 bp. The results suggest that Illumina short reads should be de novo assembled into contigs (e.g. ∼1000 bp or longer) before prediction whenever possible. We applied VirusPredictor to multiple real genomic and metagenomic datasets and obtained high accuracies. VirusPredictor, a user-friendly open-source Python software, is useful for predicting the origins of patients’ unmappable sequences. This study is the first to classify ERVs in infectious viral sequence prediction. This is also the first study combining virus sub-group predictions. Availability and implementation www.dllab.org/software/VirusPredictor.html.

biochemical research methods,biotechnology & applied microbiology,mathematical & computational biology

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the technical challenges of identifying unknown viral sequences in human high-throughput sequencing (HTS) data, especially for viruses without reference genomes. Specifically, the paper addresses the following issues: 1. **Discovery of Unknown Pathogens**: For viruses without reference genomes, it is difficult to identify them through sequence alignment. Therefore, a machine learning-based method was developed to predict virus-related sequences in patient high-throughput sequencing data that cannot be aligned with human or known pathogen genomes. 2. **Handling Endogenous Retroviruses (ERVs)**: Endogenous retroviruses (ERVs) constitute a large proportion of the human genome and have high sequence similarity with exogenous infectious viruses. However, existing methods usually do not consider ERVs. This paper is the first to include ERV classification in the prediction of infectious viruses. 3. **Viral Subgroup Classification**: The study not only predicts whether a sequence is from a virus but also further classifies the virus into six different subgroups (dsDNA, ssDNA, Retro, ssRNA(−), ssRNA(+) and dsRNA). This is the first study to combine viral subgroup prediction. 4. **Improving Prediction Accuracy**: As the length of sequencing fragments increases, the prediction accuracy also improves. For example, for sequences of 150–350 bp (Illumina short reads), 850–950 bp (Sanger sequencing data), and 2000–5000 bp, the prediction accuracies are 0.76, 0.93, and 0.98, respectively. By developing a software tool named VirusPredictor, the researchers achieved the functionality of quickly and accurately predicting viral sequences from patient HTS data, particularly for sequences that cannot be aligned with existing reference genomes. Additionally, the software can handle sequencing fragments of different lengths, including assembled contigs and short reads.

VirusPredictor: XGBoost-based software to predict virus-related sequences in human data

Identifying viruses from metagenomic data by deep learning

DeepViral: prediction of novel virus–host interactions from protein sequences and infectious disease phenotypes

Interpretable detection of novel human viruses from genome sequencing data

Prokaryotic virus host predictor: a Gaussian model for host prediction of prokaryotic viruses in metagenomics

De-heterogeneity of the eukaryotic viral reference database (EVRD) improves the accuracy and efficiency of viromic analysis

VIRALpre: Genomic Foundation Model Embedding Fused with K-mer Feature for Virus Identification

Virus-PLoc: a fusion classifier for predicting the subcellular localization of viral proteins within host and virus-infected cells.

VirBot: an RNA Viral Contig Detector for Metagenomic Data.

VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data

Virus Database and Online Inquiry System Based on Natural Vectors

Prediction of virus-host infectious association by supervised learning methods

Prediction of Human-Virus Protein-Protein Interactions Through a Sequence Embedding-Based Machine Learning Method

Vgas: A Viral Genome Annotation System.

Microseek: A Protein-Based Metagenomic Pipeline for Virus Diagnostic and Discovery

Prediction of Virus-Receptor Interactions Based on Similarity and Matrix Completion

VirusImmu: a novel ensemble machine learning approach for viral immunogenicity prediction

Virus-host interactions predictor (VHIP): Machine learning approach to resolve microbial virus-host interaction networks

Prediction of cross-species infection propensities of viruses with receptor similarity

Viral Immunogenicity Prediction by Machine Learning Methods

Evolution-guided Large Language Model is a Predictor of Virus Mutation Trends