De-heterogeneity of the eukaryotic viral reference database (EVRD) improves the accuracy and efficiency of viromic analysis

Junjie Chen,Xiaomin Yan,Yue Sun,Zilin Ren,Guangzhi Yan,Guoshuai Wang,Yuhang Liu,Zihan Zhao,Yang Liu,Changchun Tu,Biao He
DOI: https://doi.org/10.1101/2022.03.03.482774
2022-03-03
Abstract:Abstract Widespread in public databases, the notorious contamination in virus reference databases often leads to confusing even wrong conclusions in applications like viral disease diagnosis and viromic analysis, highlighting the need of a high-quality database. Here, we report the comprehensive scrutiny and the purification of the largest viral sequence collections of GenBank and UniProt by detection and characterization of heterogeneous sequences (HGSs). A total of 766 nucleotide- and 276 amino acid-HGSs were determined with length up to 6,605 bp, which were widely distributed in 39 families, with many involving highly public health-related viruses, such as hepatitis C virus, Crimea-Congo hemorrhagic fever virus and filovirus. Majority of these HGSs are sequences of a wide range of hosts including humans, with the rest resulting from vectors, misclassification and laboratory components. However, these HGSs cannot be simply considered as exotic contaminants, since part of which are resultants of natural occurrence or artificial engineering of the viruses. Nevertheless, they significantly disturb the genomic analysis, and hence were deleted from the database. A further augmentation was implemented with addition of the risk and vaccine sequences, which finally results in a high-quality eukaryotic virus reference database (EVRD). EVRD showed higher accuracy and less time-consuming without coverage compromise by reducing false positives than other integrated databases in viromic analysis. EVRD is freely accessible with favorable application in viral disease diagnosis, taxonomic clustering, viromic analysis and novel virus detection.
What problem does this paper attempt to address?