CNValidatron, automated validation of CNV calls using computer vision

Simone Montalbano,Bragi Walters,Gudbjorn Jonsson,Jesper Gaadin,Thomas Werge,Hreinn Stefansson,Kari Stefansson,Andres Ingason
DOI: https://doi.org/10.1101/2024.09.09.612035
2024-10-22
Abstract:For more than a decade, running PennCNV on SNP array data has been the gold standard for detecting Copy Number Variants (CNVs, deletions and duplications). It is generally assumed that PennCNV has high sensitivity but poor specificity, leading to a large portion of CNV calls being false positives. Researchers often rely on manual inspection of the raw data trends to validate CNV calls. However, this approach is not feasible for more than a handful of loci in large collections. Here we present an R package implementing a convolutional neural network capable of automating CNV validation with an accuracy comparable to a trained human analyst. We also present an in-depth analysis into PennCNV false positive and false negative rates. Finally, we propose an algorithm to simplify the analysis of genome-wide CNV calls computing CNV regions. The code is available on GitHub https://github.com/SinomeM/CNValidatron_fl.
Bioinformatics
What problem does this paper attempt to address?
This paper aims to solve the problem of high false - positive rate when using PennCNV to detect copy number variations (CNVs) in single - nucleotide polymorphism (SNP) array data. Specifically: 1. **High false - positive rate**: Although PennCNV has high sensitivity in detecting CNVs, its specificity is poor, resulting in a large number of CNV calls being false positives. Researchers usually rely on manual inspection of the original data to verify CNV calls, but this method is not feasible in large - scale sample collections. 2. **Automated verification**: In order to improve the accuracy of CNV calls and reduce the false - positive rate, the authors developed an R package based on convolutional neural networks (CNNs) - CNValidatron, which can automatically perform CNV verification, and its accuracy and precision are comparable to those of well - trained human analysts. 3. **Data analysis**: The paper also deeply analyzes the false - positive and false - negative rates of PennCNV, and proposes an algorithm to simplify the analysis of genome - wide CNV calls, and better organize and understand CNV data by calculating CNV regions (CNVRs). ### Main contributions - **Automated verification tool**: Developed CNValidatron, an R package based on computer vision for automated verification of CNV calls. - **Performance evaluation**: Through tests on different data sets, the accuracy and precision of CNValidatron have been proven. - **CNVRs algorithm**: Proposed a new method to define and group CNVRs through the Leiden community detection algorithm in network analysis, so as to better understand and analyze CNV data. ### Problems solved - **High false - positive rate**: Significantly reduced the false - positive rate through the automated verification tool and improved the accuracy of CNV calls. - **Large - scale data processing**: Solved the infeasibility of manual verification in large - scale data sets and provided an efficient automated solution. - **Data organization**: Through the CNVRs algorithm, better organized and understood CNV data, providing support for subsequent research. Hope this information helps you understand the purpose and contributions of the paper! If you have any further questions or need more detailed explanations, please feel free to let me know.