Identifying viruses from metagenomic data by deep learning

Jie Ren,Kai Song,Chao Deng,Nathan A. Ahlgren,Jed A. Fuhrman,Yi Li,Xiaohui Xie,Fengzhu Sun
DOI: https://doi.org/10.48550/arXiv.1806.07810
IF: 4.31
2018-06-20
Genomics
Abstract:The recent development of metagenomic sequencing makes it possible to sequence microbial genomes including viruses in an environmental sample. Identifying viral sequences from metagenomic data is critical for downstream virus analyses. The existing reference-based and gene homology-based methods are not efficient in identifying unknown viruses or short viral sequences. Here we have developed a reference-free and alignment-free machine learning method, DeepVirFinder, for predicting viral sequences in metagenomic data using deep learning techniques. DeepVirFinder was trained based on a large number of viral sequences discovered before May 2015. Evaluated on the sequences after that date, DeepVirFinder outperformed the state-of-the-art method VirFinder at all contig lengths. Enlarging the training data by adding millions of purified viral sequences from environmental metavirome samples significantly improves the accuracy for predicting under-represented viruses. Applying DeepVirFinder to real human gut metagenomic samples from patients with colorectal carcinoma (CRC) identified 51,138 viral sequences belonging to 175 bins. Ten bins were associated with the cancer status, indicating their potential use for non-invasive diagnosis of CRC. In summary, DeepVirFinder greatly improved the precision and recall rates of viral identification, and it will significantly accelerate the discovery rate of viruses.
What problem does this paper attempt to address?