Abstract:Recent breakthroughs have enabled the inference of genealogies from large sequencing data-sets, accurately reconstructing local trees that describe genetic ancestry at each locus. These genealogies should also capture the correlation structure of local trees along the genome, reflecting historical recombination events and factors like demography and natural selection. However, whether reconstructed genealogies do accurately capture this correlation structure has not been rigorously explored. This is important to address, since uncovering regions that depart from expectations can drive the discovery of new biological phenomena. Addressing this is crucial, as uncovering regions that deviate from expectations can reveal new biological phenomena, such as the suppression of recombination allowing linked selection over broad regions, evidenced in humans and in adaptive introgression events in various species. We use a theoretical framework to characterise properties of genealogies, such as the distribution of genomic spans of clades and edges, and demonstrate that our theoretical results match observations in various simulated scenarios. Testing genealogies reconstructed using leading approaches, we find departures from theoretical expectations for all methods. However, for the method Relate, a set of simple corrections results in almost complete recovery of the target distributions. Applying these corrections to genealogies reconstructed using Relate for 2504 human genomes, we observe an excess of clades with unexpectedly long genomic spans (125 with p < 1 \cdot 10^{-12} clustering into 50 regions), indicating localised suppression of historical recombination. The strongest signal corresponds to a known inversion on chromosome 17, while the second strongest represents a previously unknown inversion on chromosome 10, which is most common (21%) in S.~Asians and correlates with GWAS hits for a range of phenotypes including immunological traits. Other signals suggest additional large inversions (4), copy number changes (2), and complex rearrangements or other variants (12), as well as 28 regions with strong support but no clear classification. Our approach can be readily applied to other species, and show that genealogies offer previously untapped potential to study structural variation and its impacts at a population level, revealing new phenomena impacting evolution.

Manifold Learning for Human Population Structure Studies

Population structure analysis using rare and common functional variants

CONE: Community Oriented Network Estimation Is a Versatile Framework for Inferring Population Structure in Large-Scale Sequencing Data

Mixed Linear Model Approaches of Association Mapping for Complex Traits Based on Omics Variants

Genetic Structure of the Han Chinese Population Revealed by Genome-wide SNP Variation

Fine-scale Detection of Population-Specific Linkage Disequilibrium Using Haplotype Entropy in the Human Genome.

The length of haplotype blocks and signals of structural variation in reconstructed genealogies

Fine Population Structure Analysis Method for Genomes of Many

LLR: a Latent Low-Rank Approach to Colocalizing Genetic Risk Variants in Multiple GWAS.

Analysis of East Asia genetic substructure using genome-wide SNP arrays

A New Statistical Framework for Genetic Pleiotropic Analysis of High Dimensional Phenotype Data

Efficient and Accurate Multiple-Phenotype Regression Method for High Dimensional Data Considering Population Structure

Harnessing deep learning for population genetic inference

LMM-Lasso: A Lasso Multi-Marker Mixed Model for Association Mapping with Population Structure Correction

A Robust and Powerful Two‐step Testing Procedure for Local Ancestry Adjusted Allelic Association Analysis in Admixed Populations

A fast linkage method for population GWAS cohorts with related individuals

Inferring weak population structure with the assistance of sample group information

Measuring linkage disequilibrium and improvement of pruning and clumping in structured populations

An Atlas of Linkage Disequilibrium Across Species

Genealogy based trait association with LOCATER boosts power at loci with allelic heterogeneity

fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Data Sets