XVir: A Transformer-Based Architecture for Identifying Viral Reads from Cancer Samples

Shorya Consul,John Robertson,Haris Vikalo
2023-08-28
Abstract:It is estimated that approximately 15% of cancers worldwide can be linked to viral infections. The viruses that can cause or increase the risk of cancer include human papillomavirus, hepatitis B and C viruses, Epstein-Barr virus, and human immunodeficiency virus, to name a few. The computational analysis of the massive amounts of tumor DNA data, whose collection is enabled by the recent advancements in sequencing technologies, have allowed studies of the potential association between cancers and viral pathogens. However, the high diversity of oncoviral families makes reliable detection of viral DNA difficult and thus, renders such analysis challenging. In this paper, we introduce XVir, a data pipeline that relies on a transformer-based deep learning architecture to reliably identify viral DNA present in human tumors. In particular, XVir is trained on genomic sequencing reads from viral and human genomes and may be used with tumor sequence information to find evidence of viral DNA in human cancers. Results on semi-experimental data demonstrate that XVir is capable of achieving high detection accuracy, generally outperforming state-of-the-art competing methods while being more compact and less computationally demanding.
Genomics,Machine Learning
What problem does this paper attempt to address?
The main goal of this paper is to propose a method based on the Transformer architecture for reliably identifying viral DNA sequences from cancer samples. Specifically, the research team developed a data processing pipeline called XVir, which utilizes the Transformer deep learning model to identify viral genomes present in human tumor cells. By training the model to process sequencing read data from both viral and human genomes, XVir can analyze tumor sequence information to look for evidence of viral DNA presence. The paper addresses the following key issues: 1. **Challenge of High Variability**: The rapid evolution of viral genomes and the incompleteness of gene databases make it difficult to reliably detect highly variable viral families. 2. **Limitations of Existing Tools**: Although there are various existing tools for viral DNA detection, their effectiveness varies. For example, some methods use hidden Markov models, k-mer frequencies, or convolutional neural networks (CNNs), but they have limitations in terms of accuracy and efficiency. Features of XVir include: - **Efficiency**: Compared to existing advanced methods, XVir maintains high accuracy while having a smaller model size and lower computational requirements. - **Performance Advantage**: Experimental results show that XVir outperforms or is comparable to state-of-the-art competitors on semi-experimental data, especially in terms of accuracy. - **Flexibility**: XVir can handle k-mers of different lengths, and its performance improves with increasing k-mer length, although model complexity also increases. - **Data Requirement**: XVir can effectively utilize a smaller amount of training data, indicating good generalization capability. In summary, XVir provides an efficient and accurate method for identifying viral DNA in cancer samples, which is crucial for understanding the role of viruses in cancer development.