Abstract:Copy Number Variations (CNVs) play pivotal roles in the etiology of complex diseases and are variable across diverse populations. Understanding the association between CNVs and disease susceptibility is of significant importance in disease genetics research and often requires analysis of large sample sizes. One of the most cost-effective and scalable methods for detecting CNVs is based on normalized signal intensity values, such as Log R Ratio (LRR) and B Allele Frequency (BAF), from Illumina genotyping arrays. In this study, we present CNV-Finder, a novel pipeline integrating deep learning techniques on array data, specifically a Long Short-Term Memory (LSTM) network, to expedite the large-scale identification of CNVs within predefined genomic regions. This facilitates the efficient prioritization of samples for subsequent, costly analyses such as short-read and long-read whole genome sequencing. We focus on five genes--Parkin (PRKN), Leucine Rich Repeat And Ig Domain Containing 2 (LINGO2), Microtubule Associated Protein Tau (MAPT), alpha-Synuclein (SNCA), and Amyloid Beta Precursor Protein (APP)--which may be relevant to neurological diseases such as Alzheimer's disease (AD), Parkinson's disease (PD), or related disorders such as essential tremor (ET). By training our models on expert-annotated samples and validating them across diverse cohorts, including those from the Global Parkinson's Genetics Program (GP2) and additional dementia-specific databases, we demonstrate the efficacy of CNV-Finder in accurately detecting deletions and duplications. Our pipeline outputs app-compatible files for visualization within CNV-Finder's interactive web application. This interface enables researchers to review predictions and filter displayed samples by model prediction values, LRR range, and variant count in order to explore or confirm results. Our pipeline integrates this human feedback to enhance model performance and reduce false positive rates. Through a series of comprehensive analyses and validations using both short-read and long-read sequencing data, we demonstrate the robustness and adaptability of CNV-Finder in identifying CNVs with regions of varied sparsity, noise, and size. Our findings highlight the significance of contextual understanding and human expertise in enhancing the precision of CNV identification, particularly in complex genomic regions like 17q21.31. The CNV-Finder pipeline is a scalable, publicly available resource for the scientific community, available on GitHub (https://github.com/GP2code/CNV-Finder; DOI 10.5281/zenodo.14182563). CNV-Finder not only expedites accurate candidate identification but also significantly reduces the manual workload for researchers, enabling future targeted validation and downstream analyses in regions or phenotypes of interest.

Efficient Cnv Breakpoint Analysis Reveals Unexpected Structural Complexity and Correlation of Dosage-Sensitive Genes with Clinical Severity in Genomic Disorders

Novel Association Strategy with Copy Number Variation for Identifying New Risk Loci of Human Diseases

Diagnostic and Clinical Utility of Whole Genome Sequencing in a Cohort of Undiagnosed Chinese Families with Rare Diseases

Rearrangement Structure-Independent Strategy of CNV Breakpoint Analysis

Mechanisms for nonrecurrent genomic rearrangements associated with CMT1A or HNPP: rare CNVs as a cause for missing heritability.

Detection of Chromosomal Breakpoints in Patients with Developmental Delay and Speech Disorders.

Genomic Balancing Act: deciphering DNA rearrangements in the complex chromosomal aberration involving 5p15.2, 2q31.1, and 18q21.32

Long-read sequencing and optical genome mapping identify causative gene disruptions in noncoding sequence in two patients with neurologic disease and known chromosome abnormalities

Genetic and functional characterization of inherited complex chromosomal rearrangements in a family with multisystem anomalies

CNVbase: Batch Identification of Novel and Rare Copy Number Variations Based on Multi-Ethnic Population Data.

Concordance of whole-genome long-read sequencing with standard clinical testing for Prader-Willi and Angelman syndromes

CNV-Finder: Streamlining Copy Number Variation Discovery

Comparative Study of Three PCR-Based Copy Number Variant Approaches , CFMSA , M-PCR , and MLPA , in 22 Q 11 . 2 Deletion Syndrome

Development of coupling controlled polymerizations by adapter-ligation in mate-pair sequencing for detection of various genomic variants in one single assay.

CNV-Profile Regression: A New Approach for Copy Number Variant Association Analysis in Whole Genome Sequencing Data

Designing A Simple Multiplex Ligation-Dependent Probe Amplification (mlpa) Assay for Rapid Detection of Copy Number Variants in the Genome

Mate-pair Library Construction with Controlled Polymerization Enables Comprehensive Structural Rearrangement Detection

Genome-wide association study of copy number variations in Parkinson's disease

A MALDI-TOF mass spectrometry-based method for detection of copy number variations in BRCA1 and BRCA2 genes

Genomic Duplication Resulting in Increased Copy Number of Genes Encoding the Sister Chromatid Cohesion Complex Conveys Clinical Consequences Distinct from Cornelia De Lange

An Integrated Approach Including CRISPR/Cas9-Mediated Nanopore Sequencing, Mate Pair Sequencing, and Cytogenomic Methods to Characterize Complex Structural Rearrangements in Acute Myeloid Leukemia