Estimation of demography and mutation rates from one million haploid genomes

Joshua G Schraiber,Jeffrey P. Spence,Michael D. Edge
DOI: https://doi.org/10.1101/2024.09.18.613708
2024-09-22
Abstract:As genetic sequencing costs have plummeted, datasets with sizes previously unthinkable have begun to appear. Such datasets present new opportunities to learn about evolutionary history, particularly via rare alleles that record the very recent past. However, beyond the computational challenges inherent in the analysis of many large-scale datasets, large population-genetic datasets present theoretical problems. In particular, the majority of population-genetic tools require the assumption that each mutant allele in the sample is the result of a single mutation (the "infinite sites" assumption), which is violated in large samples. Here, we present , a method for estimating mutation rates and recent demographic history from very large samples. avoids the infinite-sites assumption by using a diffusion approximation to a branching-process model with recurrent mutation. The branching-process approach limits the method to rare alleles, but, along with recent results, renders tractable likelihoods with recurrent mutation. We show that performs well in simulations and apply it to rare-variant data from a million haploid samples, identifying a signal of mutation-rate heterogeneity within commonly analyzed classes and predicting that in modern sample sizes, most rare variants at sites with high mutation rates represent the descendants of multiple mutation events.
Genomics
What problem does this paper attempt to address?