Large complex structural rearrangements in human genomes harbor cryptic structures

Peter A Audano,Carolyn Paisie,The Human Genome Structural Variation Consortium,Christine R Beck
DOI: https://doi.org/10.1101/2024.12.19.629504
2024-12-22
Abstract:Structural variation is a major contributor to human diversity, adaptation, and disease. Simple structural variant (SV) types include deletions, insertions, duplications, inversions, and translocations, and SVs account for most of the variable bases between genomes. Complex structural variants (CSVs) that consist of one or more simple events in cis appear more frequently in diseases and cancers where DNA repair, apoptosis, and cell cycle checkpoints are compromised, although CSVs can also appear in germline genome sequences of healthy individuals. CSVs are often characterized by short tracts of homology or no homology, and while CSVs are more prevalent in complex regions that contain large repeats, smaller stretches of homology can also enable their formation across more unique loci. Long-read assemblies have increased the size of detectable SVs and expanded variant detection into more complex regions of the genome, and while they reconstruct CSVs, methods for identifying CSVs from assemblies is limited. Here, we have developed a new assembly-based approach to trace through complex loci rather than relying upon reference representations of alignments. Using this approach, we can now access CSVs in large complex segmental duplications, reveal structures that were previously unknown, and identify SV breakpoints with greater accuracy. With this approach, we find 72 large CSVs per genome and 128 unique complex structures. CSVs in highly repetitive regions can now including several distinct complex events in repetitive NBPF genes that was not previously callable with short-read or long-read CSV methods. This approach is implemented within a key assembly-based variant calling tool, PAV, and represents a substantial improvement identifying complex variants now ascertainable from contiguous genome assemblies.
Genomics
What problem does this paper attempt to address?