Faster model-based estimation of ancestry proportions

Jonas Meisner,Cindy Santander,Alba Refoyo Martinez
DOI: https://doi.org/10.1101/2024.07.08.602454
2024-07-11
Abstract:Ancestry estimation from genotype data in unrelated individuals has become an essential tool in population and medical genetics to understand demographic population histories and to model or correct for population structure. The ADMIXTURE software is a widely used model-based approach to account for population stratification, however, it struggles with convergence issues and does not scale to modern human datasets or to the large number of variants in whole-genome sequencing data. Likelihood-free approaches optimize a least square objective and have gained popularity in recent years due to their scalability. However, this comes at the cost of accuracy in the ancestry estimates. We present a new model-based approach, fastmixture, which adopts aspects from likelihood-free approaches for parameter initialization, followed by a mini-batch expectation-maximization procedure to model the standard likelihood. We demonstrate in a simulation study that the model-based approaches of fastmixture and ADMIXTURE are significantly more accurate than recent and likelihood-free approaches. We further show that fastmixture runs approximately 20 times faster than ADMIXTURE on both simulated and empirical data from the 1000 Genomes Project such that our model-based approach scales to much larger sample sizes than previously possible. Our software is freely available at https://github.com/Rosemeis/fastmixture.
Bioinformatics
What problem does this paper attempt to address?