Abstract:Long-read sequencing (LRS) enables variant calling of high-quality structural variants (SVs). Genotypers of SVs utilize these precise call sets to increase the recall and precision of genotyping in short-read sequencing (SRS) samples. With the extensive growth in availabilty of SRS datasets in recent years, we should be able to calculate accurate population allele frequencies of SV. However, reprocessing hundreds of terabytes of raw SRS data to genotype new variants is impractical for population-scale studies, a computational challenge known as the N+1 problem. Solving this computational bottleneck is necessary to analyze new SVs from the growing number of pangenomes in many species, public genomic databases, and pathogenic variant discovery studies. To address the N+1 problem, we propose The Great Genotyper, a population genotyping workflow. Applied to a human dataset, the workflow begins by preprocessing 4.2K short-read samples of a total of 183TB raw data to create an 867GB Counting Colored De Bruijn Graph (CCDG). The Great Genotyper uses this CCDG to genotype a list of phased or unphased variants, leveraging the CCDG population information to increase both precision and recall. The Great Genotyper offers the same accuracy as the state-of-the-art genotypers with the addition of unprecedented performance. It took 100 hours to genotype 4.5M variants in the 4.2K samples using one server with 32 cores and 145GB of memory. A similar task would take months or even years using single-sample genotypers. The Great Genotyper opens the door to new ways to study SVs. We demonstrate its application in finding pathogenic variants by calculating accurate allele frequency for novel SVs. Also, a premade index is used to create a 4K reference panel by genotyping variants from the Human Pangenome Reference Consortium (HPRC). The new reference panel allows for SV imputation from genotyping microarrays. Moreover, we genotype the GWAS catalog and merge its variants with the 4K reference panel. We show 6.2K events of high linkage between the HPRC's SVs and nearby GWAS SNPs, which can help in interpreting the effect of these SVs on gene functions. This analysis uncovers the detailed haplotype structure of the human fibrinogen locus and revives the pathogenic association of a 28 bp insertion in the FGA gene with thromboembolic disorders.

Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes

Integer programming framework for pangenome-based genome inference

Personalized pangenome references

Large-scale Genotyping of Complex DNA

Efficient inference of large pangenomes with PanTA

Minimizing Reference Bias with an Impute-First Approach

PanKmer: k-mer-based and reference-free pangenome analysis

Haplotype-aware pantranscriptome analyses using spliced pangenome graphs

The Great Genotyper: A Graph-Based Method for Population Genotyping of Small and Structural Variants

Building pangenome graphs

Exploring intra- and intergenomic variation in haplotype-resolved pangenomes

Efficient inference of large prokaryotic pangenomes with PanTA

The power and promise of population genomics: from genotyping to genome typing

k-mer-based approaches to bridging pangenomics and population genetics

PANGENOMES AID ACCURATE DETECTION OF LARGE INSERTION AND DELETIONS FROM GENE PANEL DATA: THE CASE OF CARDIOMYOPATHIES

Genotyping sequence-resolved copy-number variation using pangenomes reveals paralog-specific global diversity and expression divergence of duplicated genes

Compressive Pangenomics Using Mutation-Annotated Networks

Powerful gene-based testing by integrating long-range chromatin interactions and knockoff genotypes

Pangenome graph construction from genome alignments with Minigraph-Cactus