DeepCNV: a deep learning approach for authenticating copy number variations

Joseph T Glessner,Xiurui Hou,Cheng Zhong,Jie Zhang,Munir Khan,Fabian Brand,Peter Krawitz,Patrick M A Sleiman,Hakon Hakonarson,Zhi Wei
DOI: https://doi.org/10.1093/bib/bbaa381
IF: 9.5
2021-01-12
Briefings in Bioinformatics
Abstract:Abstract Copy number variations (CNVs) are an important class of variations contributing to the pathogenesis of many disease phenotypes. Detecting CNVs from genomic data remains difficult, and the most currently applied methods suffer from an unacceptably high false positive rate. A common practice is to have human experts manually review original CNV calls for filtering false positives before further downstream analysis or experimental validation. Here, we propose DeepCNV, a deep learning-based tool, intended to replace human experts when validating CNV calls, focusing on the calls made by one of the most accurate CNV callers, PennCNV. The sophistication of the deep neural network algorithm is enriched with over 10 000 expert-scored samples that are split into training and testing sets. Variant confidence, especially for CNVs, is a main roadblock impeding the progress of linking CNVs with the disease. We show that DeepCNV adds to the confidence of the CNV calls with an optimal area under the receiver operating characteristic curve of 0.909, exceeding other machine learning methods. The superiority of DeepCNV was also benchmarked and confirmed using an experimental wet-lab validation dataset. We conclude that the improvement obtained by DeepCNV results in significantly fewer false positive results and failures to replicate the CNV association results.
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to reduce the high false - positive rate in Copy Number Variations (CNVs) detection**. ### Background and Problem Description: 1. **Importance of CNV**: - CNV is an important genomic structural variation and is closely related to the pathogenesis of multiple complex diseases (such as schizophrenia and osteoporosis). - Detecting CNV has become a routine operation in genetic research and cancer research. 2. **Limitations of Current Methods**: - The currently most commonly used CNV detection methods (such as PennCNV and QuantiSNP) have a relatively high false - positive rate. - In order to reduce false - positive results, it usually requires human experts to manually review the original CNV detection results, but this is very time - consuming and highly subjective. 3. **Specific Problems**: - How to replace manual review with an automated method, thereby effectively reducing the false - positive rate in CNV detection? - How to improve the accuracy and confidence of CNV detection to better support disease - related research? ### Proposed Solution: The paper proposes a deep - learning - based tool **DeepCNV** aiming to solve the above problems. The main goals of DeepCNV are: - Automatically verify the CNV detection results generated by tools such as PennCNV. - Reduce the false - positive rate while maintaining high sensitivity. - Avoid the cumbersome process of manual review and improve efficiency. ### Core Innovation Points of DeepCNV: 1. **Combination of Image Data and Metadata**: - Use the LRR (Log R Ratio) and BAF (B Allele Frequency) scatter plots output by PennCNV as image data input. - Utilize the quality check statistical information (such as CNV length, SNP number, etc.) generated by PennCNV as metadata input. 2. **Deep Neural Network Architecture**: - DeepCNV adopts a hybrid deep neural network structure, consisting of two branches: - **CNN Branch**: Used to process image data and extract features from CNV scatter plots. - **DNN Branch**: Used to process metadata and analyze the influence of statistical information on the final decision. - The outputs of the two branches are concatenated and sent to a fully - connected layer, and finally the classification probability is generated through the sigmoid activation function. 3. **Large - Scale Training Data**: - Use more than 10,000 expert - annotated samples for training and testing to ensure the generalization ability of the model. ### Experimental Verification: 1. **Human - Annotated Dataset**: - On an independent human - annotated dataset, the AUC of DeepCNV reaches 0.909, which is significantly better than other machine - learning methods. - Especially in the detection of small - scale CNV (<5 kb), DeepCNV shows the greatest improvement. 2. **WGS Dataset**: - On the whole - genome sequencing (WGS) dataset, DeepCNV also shows superior performance, further verifying its applicability. 3. **Grad - CAM Visualization**: - Through Grad - CAM technology, visualize the important areas of attention of the CNN model and explain the prediction basis of the model. ### Summary: DeepCNV has successfully solved the problem of high false - positive rate in CNV detection, significantly improved the accuracy and confidence of CNV detection, and provided a more reliable tool for disease - related genomic research.