Abstract:Virus, a submicroscopic infectious agent, influences all life forms. Identifying viral sequences is essential to understand their biological functions and then analyze their impacts on public health, and the development of microbial communities. For its significance, tools are developed based on various mathematical methods and algorithms. However, previous methods struggle to identify viral sequences, especially short contigs accurately since the limited information and small-scale close-set dataset. Here we propose VIRALpre, a hybrid framework combined with genomic foundation model (GFM) embedding and K-mer feature of sequences to precisely recognize viral genomic fragments. VIRALpre is empowered by the generalization competencies of GFMs, which have proven their strength in various downstream tasks, thanks to newly established large-scale training databases and Attention mechanism. On the other hand, K-mer features provide additional biological information to bridge the limitation of GFMs in classification tasks. Comprehensive experimental results demonstrate that VIRALpre significantly outperforms all the previous methods on virus identification performance by 4% in accuracy. To prove that this model is qualified when facing unique contigs to training data, BLASTn-based similarity cut-off test (setting e-value as 10 to the minus 5) is done and it achieves about 10% F1-score improvement. More than well-built test datasets, new zero-shot cross-dataset tests on benchmark datasets sampling from natural environments are conducted, VIRALpre performs identify almost most viral sequences while keeping a very low False Positive Rate. Based on these solid experiments, VIRALpre has the ability to manage short-contig virus identification by truly learning the distinctions of viral sequences and hopefully act as an adviser to promote virus-related research.

Viral Sequence Identification in Metagenomes using Natural Language Processing Techniques

DeePhage: Distinguishing Virulent and Temperate Phage-Derived Sequences in Metavirome Data with a Deep Learning Approach

AliMarko: A Novel Tool for Eukaryotic Virus Identification Using Expert-Guided Approach

Microbial and Viral Ecology Analysis for Metagenomic Data

Accurate identification of bacteriophages from metagenomic data using Transformer

Part I Examples of Natural and Nature-Inspired Materials

Classification of Highly Divergent Viruses from DNA/RNA Sequence Using Transformer-Based Models

Associations between age, body size and nephron number with individual glomerular volumes in urban West African males.

[New aspects of chemical urinary calculi analysis. 3].

Identifying viruses from metagenomic data by deep learning

TheViral MetaGenome Annotation Pipeline(VMGAP):an automated tool for the functional annotation of viral Metagenomic shotgun sequencing data

VIRALpre: Genomic Foundation Model Embedding Fused with K-mer Feature for Virus Identification

Synthesis of azasugars as potent inhibitors of glycosidases.

NLP-based classification of software tools for metagenomics sequencing data analysis into EDAM semantic annotation

CAPTVRED: an automated pipeline for viral tracking and discovery from capture-based metagenomics samples

Bioinformatic Tools for NGS-Based Metagenomics to Improve the Clinical Diagnosis of Emerging, Re-Emerging and New Viruses

Machine Learning and Deep Learning Applications in Metagenomic Taxonomy and Functional Annotation

Using Machine Learning and Natural Language Processing for Unveiling Similarities between Microbial Data

ViromeFlowX: a Comprehensive Nextflow-based Automated Workflow for Mining Viral Genomes from Metagenomic Sequencing Data

Benchmarking bioinformatic virus identification tools using real-world metagenomic data across biomes

A computational toolset for rapid identification of SARS-CoV-2, other viruses and microorganisms from sequencing data