CNV-Finder: Streamlining Copy Number Variation Discovery

Nicole Kuznetsov,Kensuke Daida,Mary B Makarious,Bashayer Al-Mubarak,Kajsa Atterling Brolin,Laksh Malik,Cedric Kouam,Breeana Baker,Miriam Ostrozovicova,Katherine M Andersh,Pin-Jui Kung,Yasser Mecheri,Yi-Wen Tay,Behloul Soundous Malek,Nada Al Tassan,Maria Teresa Perinan,Samantha Hong,Mathew Koretsky,Lana Sargeant,Kristin Levine,Cornelis Blauwendraat,Kimberley J Billingsley,Sara Bandres-Ciga,Hampton L Leonard,Huw R Morris,Andrew B Singleton,Mike A Nalls,Dan Vitale,The Global Parkinson's Genetics Program
DOI: https://doi.org/10.1101/2024.11.22.624040
2024-11-23
Abstract:Copy Number Variations (CNVs) play pivotal roles in the etiology of complex diseases and are variable across diverse populations. Understanding the association between CNVs and disease susceptibility is of significant importance in disease genetics research and often requires analysis of large sample sizes. One of the most cost-effective and scalable methods for detecting CNVs is based on normalized signal intensity values, such as Log R Ratio (LRR) and B Allele Frequency (BAF), from Illumina genotyping arrays. In this study, we present CNV-Finder, a novel pipeline integrating deep learning techniques on array data, specifically a Long Short-Term Memory (LSTM) network, to expedite the large-scale identification of CNVs within predefined genomic regions. This facilitates the efficient prioritization of samples for subsequent, costly analyses such as short-read and long-read whole genome sequencing. We focus on five genes--Parkin (PRKN), Leucine Rich Repeat And Ig Domain Containing 2 (LINGO2), Microtubule Associated Protein Tau (MAPT), alpha-Synuclein (SNCA), and Amyloid Beta Precursor Protein (APP)--which may be relevant to neurological diseases such as Alzheimer's disease (AD), Parkinson's disease (PD), or related disorders such as essential tremor (ET). By training our models on expert-annotated samples and validating them across diverse cohorts, including those from the Global Parkinson's Genetics Program (GP2) and additional dementia-specific databases, we demonstrate the efficacy of CNV-Finder in accurately detecting deletions and duplications. Our pipeline outputs app-compatible files for visualization within CNV-Finder's interactive web application. This interface enables researchers to review predictions and filter displayed samples by model prediction values, LRR range, and variant count in order to explore or confirm results. Our pipeline integrates this human feedback to enhance model performance and reduce false positive rates. Through a series of comprehensive analyses and validations using both short-read and long-read sequencing data, we demonstrate the robustness and adaptability of CNV-Finder in identifying CNVs with regions of varied sparsity, noise, and size. Our findings highlight the significance of contextual understanding and human expertise in enhancing the precision of CNV identification, particularly in complex genomic regions like 17q21.31. The CNV-Finder pipeline is a scalable, publicly available resource for the scientific community, available on GitHub (https://github.com/GP2code/CNV-Finder; DOI 10.5281/zenodo.14182563). CNV-Finder not only expedites accurate candidate identification but also significantly reduces the manual workload for researchers, enabling future targeted validation and downstream analyses in regions or phenotypes of interest.
Biology
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve how to detect copy number variations (CNVs) in the genome efficiently and accurately, especially CNVs related to neurodegenerative diseases (such as Alzheimer's disease and Parkinson's disease). Specifically, the paper proposes a new tool named CNV - Finder, which integrates deep - learning techniques (especially long - short - term memory network, LSTM) and uses Illumina genotyping array data to accelerate the identification of large - scale CNVs. #### Main problems include: 1. **Improving the accuracy of CNV detection**: - Traditional CNV detection methods are prone to produce false - positive or false - negative results when dealing with large - scale samples. CNV - Finder improves the sensitivity and specificity of CNV detection by using a deep - learning model, especially the LSTM network, thereby reducing the false - positive rate. 2. **Efficiency issues in large - scale data processing**: - The genomic regions involved in the study are large, and a large number of samples need to be analyzed. CNV - Finder realizes efficient parallel processing by optimizing the data - processing flow and model architecture, significantly reducing the workload of manual verification. 3. **CNV detection in complex genomic regions**: - Some gene regions (such as 17q21.31) have complex structures, containing features such as high linkage disequilibrium, functional haplotypes, and common inversions, making CNV detection more challenging. CNV - Finder improves the CNV detection accuracy in these complex regions by combining expert annotations and multiple verification methods. 4. **Applicability across populations**: - Genetic background differences between different populations may lead to inconsistent CNV detection results. CNV - Finder ensures its applicability and reliability in diverse populations by training and validating on data from different populations around the world. 5. **Visualization and interactive feedback**: - To help researchers better understand and verify the model prediction results, CNV - Finder provides an interactive Web application. Users can explore or confirm the results by adjusting parameters such as predicted values, LRR ranges, and the number of variations, and integrate these feedbacks into the model to further improve performance. #### Key gene and disease associations: - **Parkin (PRKN)**: It is related to autosomal recessive Parkinson's disease. - **Leucine Rich Repeat And Ig Domain Containing 2 (LINGO2)**: It may be related to essential tremor and Parkinson's disease. - **Microtubule Associated Protein Tau (MAPT)**: It is related to multiple neurodegenerative diseases (such as frontotemporal dementia, dementia with Lewy bodies, etc.). - **Amyloid Beta Precursor Protein (APP)**: It is related to early - onset Alzheimer's disease and cerebral amyloid angiopathy. - **alpha - Synuclein (SNCA)**: It is related to monogenic Parkinson's disease. By solving the above problems, CNV - Finder not only improves the accuracy and efficiency of CNV detection but also provides a powerful tool and support for the research of neurodegenerative diseases.