SurVirus: a repeat-aware virus integration caller

Ramesh Rajaby,Yi Zhou,Yifan Meng,Xi Zeng,Guoliang Li,Peng Wu,Wing-Kin Sung
DOI: https://doi.org/10.1093/nar/gkaa1237
IF: 14.9
2021-01-14
Nucleic Acids Research
Abstract:Abstract A significant portion of human cancers are due to viruses integrating into human genomes. Therefore, accurately predicting virus integrations can help uncover the mechanisms that lead to many devastating diseases. Virus integrations can be called by analysing second generation high-throughput sequencing datasets. Unfortunately, existing methods fail to report a significant portion of integrations, while predicting a large number of false positives. We observe that the inaccuracy is caused by incorrect alignment of reads in repetitive regions. False alignments create false positives, while missing alignments create false negatives. This paper proposes SurVirus, an improved virus integration caller that corrects the alignment of reads which are crucial for the discovery of integrations. We use publicly available datasets to show that existing methods predict hundreds of thousands of false positives; SurVirus, on the other hand, is significantly more precise while it also detects many novel integrations previously missed by other tools, most of which are in repetitive regions. We validate a subset of these novel integrations, and find that the majority are correct. Using SurVirus, we find that HPV and HBV integrations are enriched in LINE and Satellite regions which had been overlooked, as well as discover recurrent HBV and HPV breakpoints in human genome-virus fusion transcripts.
biochemistry & molecular biology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the detection accuracy of virus integration into the human genome. Specifically, many human cancers are caused by virus integration into the host genome. Therefore, accurately predicting these virus integrations can help reveal the mechanisms that lead to many serious diseases, such as liver cancer and cervical cancer. Existing virus integration detection methods perform poorly in repetitive regions, resulting in a large number of false - positive results and many real integration events not being detected. The paper proposes a new tool, SurVirus, which aims to improve the accuracy and sensitivity of virus integration detection by improving the read - alignment algorithm, especially in the repetitive regions of the genome. ### Main problems 1. **Limitations of existing methods**: - Existing virus integration detection methods have large errors when dealing with repetitive regions, resulting in a large number of false - positive and false - negative results. - These methods perform poorly when dealing with ambiguously aligned reads, leading to misjudgments. 2. **Goals of SurVirus**: - Improve the accuracy and sensitivity of virus integration detection. - Especially in the repetitive regions of the genome, improve the reliability of detection. ### Solutions SurVirus solves the above problems through the following methods: 1. **Improving the read - alignment algorithm**: - SurVirus uses an iterative clustering method to cluster reads that support the same integration event and determine the most likely integration position. - This algorithm can handle ambiguously aligned reads in repetitive regions more accurately, reducing the occurrence of false positives and false negatives. 2. **Performance evaluation**: - Use simulated data sets and real data sets (including HCC and cervical cancer data sets) for testing to verify the performance of SurVirus. - The results show that SurVirus exhibits higher sensitivity and precision in detecting known - verified virus integration events, and at the same time detects many new integration events that other methods failed to find. ### Experimental results - **Simulated data sets**: - In data sets with randomly inserted virus fragments, existing methods perform well, but perform poorly in repetitive regions. - SurVirus exhibits the highest sensitivity and precision in all data sets. - **Real data sets**: - In HIVID and WGS data sets, SurVirus detects more known - verified virus integration events and generates fewer false - positive results at the same time. - For HPV and HBV integration events, SurVirus detects many new integration events, especially in repetitive regions. ### Conclusion SurVirus significantly improves the accuracy and sensitivity of virus integration detection by improving the read - alignment algorithm, especially in the repetitive regions of the genome. This tool is expected to play an important role in future cancer research and clinical diagnosis.