Abstract:Ancient genomic data is becoming increasingly available thanks to recent advances in high-throughput sequencing technologies. Yet, post-mortem degradation of endogenous ancient DNA often results in low depth of coverage and subsequently high levels of genotype missingness and uncertainty. Genotype imputation is a potential strategy for increasing the information available in ancient DNA samples and thus improving the power of downstream population genetic analyses. However, the performance of genotype imputation on ancient genomes under different conditions has not yet been fully explored, with all previous work primarily using an empirical approach of downsampling high coverage paleogenomes. While these studies have provided invaluable insights into best practices for imputation, they rely on a fairly limited number of existing high coverage samples with significant temporal and geographical biases. As an alternative, we used a coalescent simulation approach to generate genomes with characteristics of ancient DNA in order to more systematically evaluate the performance of two popular imputation software, BEAGLE and GLIMPSE, under variable divergence times between the target sample and reference haplotypes, as well as different depths of coverage and reference sample size. Our results suggest that for genomes with coverage <=0.1x imputation performance is poor regardless of the strategy employed. Beyond 0.1x coverage imputation is generally improved as the size of the reference panel increases, and imputation accuracy decreases with increasing divergence between target and reference populations. It may thus be preferable to compile a smaller set of less diverged reference samples than a larger more highly diverged dataset. In addition, the imputation accuracy may plateau beyond some level of divergence between the reference and target populations. While accuracy at common variants is similar regardless of divergence time, rarer variants are better imputed on less diverged target samples. Furthermore, both imputation software, but particularly GLIMPSE, overestimate high genotype probability calls, especially at low coverages. Our results provide insight into optimal strategies for ancient genotype imputation under a wide set of scenarios, complementing previous empirical studies based on imputing downsampled high-coverage ancient genomes.

Allele age estimators designed for whole genome datasets show only a modest decrease in accuracy when applied to whole exome datasets

Evaluation of whole exome sequencing as an alternative to BeadChip and whole genome sequencing in human population genetic analysis

High burden of private mutations due to explosive human population growth and purifying selection

Evaluation of ancient DNA imputation: a simulation study

The current landscape of clinical exome and genome reanalysis in the U.S.

Estimation of demography and mutation rates from one million haploid genomes

The performance of AlphaMissense to identify genes influencing disease

The performance of AlphaMissense to identify genes causing disease

Efficient storage and regression computation for population-scale genome sequencing studies

SAIGE-GENE+ improves the efficiency and accuracy of set-based rare variant association tests

On the estimation of genome-average recombination rates

Evaluation of an Automated Genome Interpretation Model for Rare Disease Routinely Used in a Clinical Genetic Lab

Estimating the age of mutant disease alleles based on linkage disequilibrium.

Analysis of protein-coding genetic variation in 60,706 humans

Quantifying genetic regulatory variation in human populations improves transcriptome analysis in rare disease patients

Unveiling the hidden: revisiting the potential of old genetic data for rare disease research

Simultaneous estimation of genotype error and uncalled deletion rates in whole genome sequence data

Systematic dissection of biases in whole-exome and whole-genome sequencing reveals major determinants of coding sequence coverage

The usefulness of whole-exome sequencing in routine clinical practice

Identifying interpretable gene-biomarker associations with functionally informed kernel-based tests in 190,000 exomes