VIRALpre: Genomic Foundation Model Embedding Fused with K-mer Feature for Virus Identification

Zanyi Wang,Qinze Yu,Yu Li
DOI: https://doi.org/10.1101/2024.11.12.623150
2024-11-15
Abstract:Virus, a submicroscopic infectious agent, influences all life forms. Identifying viral sequences is essential to understand their biological functions and then analyze their impacts on public health, and the development of microbial communities. For its significance, tools are developed based on various mathematical methods and algorithms. However, previous methods struggle to identify viral sequences, especially short contigs accurately since the limited information and small-scale close-set dataset. Here we propose VIRALpre, a hybrid framework combined with genomic foundation model (GFM) embedding and K-mer feature of sequences to precisely recognize viral genomic fragments. VIRALpre is empowered by the generalization competencies of GFMs, which have proven their strength in various downstream tasks, thanks to newly established large-scale training databases and Attention mechanism. On the other hand, K-mer features provide additional biological information to bridge the limitation of GFMs in classification tasks. Comprehensive experimental results demonstrate that VIRALpre significantly outperforms all the previous methods on virus identification performance by 4% in accuracy. To prove that this model is qualified when facing unique contigs to training data, BLASTn-based similarity cut-off test (setting e-value as 10 to the minus 5) is done and it achieves about 10% F1-score improvement. More than well-built test datasets, new zero-shot cross-dataset tests on benchmark datasets sampling from natural environments are conducted, VIRALpre performs identify almost most viral sequences while keeping a very low False Positive Rate. Based on these solid experiments, VIRALpre has the ability to manage short-contig virus identification by truly learning the distinctions of viral sequences and hopefully act as an adviser to promote virus-related research.
Bioinformatics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **the challenge of accurately identifying viral genome fragments, especially short contigs**. Specifically: 1. **Background problems**: - Viruses are sub - microscopic infectious agents that affect all forms of life. - Accurate identification of viral sequences is crucial for understanding their biological functions, analyzing their impact on public health, and the development of microbial communities. - Previous tools and methods have limitations in identifying viral sequences, especially performing poorly when dealing with short fragments, because these methods rely on limited information and small - scale closed - data sets. 2. **Limitations of existing methods**: - **Homology tools**: Such as MetaPhinder, rely on similarity to known sequences in the database and perform poorly when faced with new sequences that are not similar to the reference sequences. - **Machine - learning methods**: Such as HMM and random forests, rely on long - sequence inputs to model the relationships between nucleotide or protein markers, which limits their performance in short - sequence analysis. - **Deep - learning methods**: Such as DeepVirFinder and Seeker, although able to capture global and local dependencies, have insufficient generalization ability when dealing with unique sequences due to the limitations of the training data set, resulting in a decline in precision. 3. **Proposed method**: - The paper proposes VIRALpre, a hybrid framework that combines **genome - foundation - model (GFM) embeddings** and **K - mer features**, aiming to accurately identify viral genome fragments. - GFM enhances the generalization ability through large - scale pre - training databases and the attention mechanism. - K - mer features provide additional biological information, making up for the limitations of GFM in classification tasks. 4. **Objectives**: - Improve the accuracy of viral - sequence identification, especially the identification ability on short fragments. - Verify the performance of VIRALpre on different data sets through comprehensive experiments, and prove its effectiveness and stability when facing unique sequences. In summary, this paper aims to develop a new method that can more accurately identify viral genome fragments in various situations, especially for short - fragment identification, by combining GFM and K - mer features.