Pangenome graphs improve the analysis of structural variants in rare genetic diseases

Cristian Groza,Carl Schwendinger-Schreck,Warren A. Cheung,Emily G. Farrow,Isabelle Thiffault,Juniper Lake,William B. Rizzo,Gilad Evrony,Tom Curran,Guillaume Bourque,Tomi Pastinen
DOI: https://doi.org/10.1038/s41467-024-44980-2
IF: 16.6
2024-01-22
Nature Communications
Abstract:Abstract Rare DNA alterations that cause heritable diseases are only partially resolvable by clinical next-generation sequencing due to the difficulty of detecting structural variation (SV) in all genomic contexts. Long-read, high fidelity genome sequencing (HiFi-GS) detects SVs with increased sensitivity and enables assembling personal and graph genomes. We leverage standard reference genomes, public assemblies ( n = 94) and a large collection of HiFi-GS data from a rare disease program (Genomic Answers for Kids, GA4K, n = 574 assemblies) to build a graph genome representing a unified SV callset in GA4K, identify common variation and prioritize SVs that are more likely to cause genetic disease (MAF < 0.01). Using graphs, we obtain a higher level of reproducibility than the standard reference approach. We observe over 200,000 SV alleles unique to GA4K, including nearly 1000 rare variants that impact coding sequence. With improved specificity for rare SVs, we isolate 30 candidate SVs in phenotypically prioritized genes, including known disease SVs. We isolate a novel diagnostic SV in KMT2E , demonstrating use of personal assemblies coupled with pangenome graphs for rare disease genomics. The community may interrogate our pangenome with additional assemblies to discover new SVs within the allele frequency spectrum relevant to genetic diseases.
multidisciplinary sciences
What problem does this paper attempt to address?
The problem this paper attempts to address is the difficulty in detecting structural variations (SV) in rare genetic diseases. Specifically, traditional clinical next-generation sequencing techniques have limitations in detecting SVs across all genomic contexts, particularly in repetitive sequence regions. Long-read, high-fidelity genome sequencing (HiFi-GS) can improve the sensitivity of SV detection and enable the assembly of individual and reference genomes. However, existing SV detection methods often rely on a single reference genome, which can lead to omissions or incorrect detections in some regions. To address these issues, the authors utilized standard reference genomes, public assembly data, and HiFi-GS data from a large rare disease project (Genomic Answers for Kids, GA4K) to construct a pan-genome graph. This graph was used to uniformly detect SVs in GA4K, identify common variations, and prioritize rare variations (MAF < 0.01) that may cause genetic diseases. Through this approach, the authors aim to improve the accuracy and specificity of SV detection, thereby better understanding the genetic basis of rare genetic diseases. The main objectives include: 1. **Improving the accuracy of SV detection**: By constructing a pan-genome graph, reducing errors associated with relying on a single reference genome. 2. **Identifying rare variations**: Prioritizing those SVs with very low frequency (MAF < 0.01) that may be pathogenic. 3. **Enhancing reproducibility**: Achieving higher reproducibility with the graph-based method compared to traditional reference genome methods. 4. **Discovering new SVs**: Using the pan-genome graph to discover new SVs, particularly those rare variations that affect coding sequences. Through these efforts, the authors hope to provide more powerful tools and methods for the diagnosis and research of rare genetic diseases.