Benchmarking Imputed Low Coverage Genomes in a Human Population Genetics Context

Gludhug Ariyo Purnomo,Joao Carlos Teixeira,Herawati Sudoyo,Bastien Llamas,Raymond Tobler
DOI: https://doi.org/10.1101/2024.06.02.597067
2024-06-03
Abstract:Ongoing advances in population genomic methodologies have recently made it possible to study millions of loci across hundreds of genomes at a relatively low cost, by leveraging a combination of low-coverage shotgun sequencing and innovative genotype imputation methods. This approach has the potential to provide economical access to genotype information that is similar to most widely used low-cost genotyping approach, i.e. SNP panels, while avoiding potential issues related to loci being ascertained in distantly related populations. Nonetheless, adoption of imputation methods has been constrained by the lack of suitable reference panels of phased genomes, as performance degrades when panel individuals are distantly related to the target populations. Recent advances in imputation algorithms now allow genetic information from the target population to be used in the imputation process, however, potentially mitigating the lack of a suitable reference panel. Here we assess the performance of the recently released GLIMPSE imputation software on a set of 250 low coverage genomes (~3x) from populations from Island Southeast Asia and Near Oceania that are poorly represented in publicly available datasets, comparing the use of imputed genotypes against other common genotype calling methods for a range of standard population genomic analyses. We find that imputation performance and inference both greatly improved when genetic information from the 250 target individuals was leveraged, with comparable results to pseudo-haploid calls that trade off improved precision with reduced accuracy. Our study shows that imputed genotypes are a cost effective and robust basis for population genomic studies of groups, especially those that are poorly represented in publicly available data.
Evolutionary Biology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the performance of low - coverage genomes (lc - WGS) in population genetics analysis, especially when using the genotype imputation method. Specifically, the research aims to: 1. **Evaluate the performance of different genotype inference methods**: By comparing four different genotype inference methods (naive genotypes, imputed genotypes using 8 high - coverage samples, imputed genotypes using all 256 low - coverage samples, pseudohaploid calls), evaluate their accuracy and missing rate in low - coverage data. 2. **Verify the application of low - coverage genomes in standard population genetics analysis**: Including principal component analysis (PCA), ancestry component estimation (ADMIXTURE) and f4 statistics, to determine whether the performance of these methods on low - coverage data can be comparable to that on high - coverage data. 3. **Explore the advantages of the GLIMPSE algorithm in low - coverage genome imputation**: Especially when the genetic information of the target population is incorporated into the imputation process, whether it can improve the accuracy and performance of imputation. ### Research background In recent years, with the progress of sequencing technology, low - coverage whole - genome sequencing (lc - WGS) combined with the genotype imputation method has become a cost - effective means to obtain genotype information in large - scale populations. However, the effectiveness of this method depends on an appropriate reference panel, and existing reference panels often lack representativeness for certain specific populations (such as Southeast Asian island and Near Oceania populations). Therefore, this study hopes to improve the effect of genotype imputation by introducing the genetic information of the target population and verify its application in population genetics analysis. ### Method overview - **Sample selection**: The study selected 256 individuals from 11 different populations in Southeast Asian islands and Near Oceania. - **Sequencing and processing**: Among them, 8 samples were sequenced with high coverage (~30x), and the remaining samples were sequenced with low coverage (~3x). The data of all samples were pre - processed and aligned. - **Genotype inference**: The GLIMPSE algorithm was used for genotype imputation, and two cases were tested respectively: using only 8 high - coverage samples as references and using all 256 low - coverage samples as references. - **Performance evaluation**: By comparing with the true genotypes of high - coverage samples, the accuracy and missing rate of different genotype inference methods were evaluated. In addition, PCA, ADMIXTURE and f4 - statistics analysis were also carried out to evaluate the performance of these methods on low - coverage data. ### Main findings - **The Impute_all method performs the best**: In most cases, the method of imputation using all 256 low - coverage samples (Impute_all) shows the highest accuracy and the lowest missing rate, especially at heterozygous loci. - **The Pseudohaploid method is second**: The pseudohaploid method has a relatively high accuracy, but there is a reference bias at heterozygous loci. - **The Naive method performs the worst**: The method of directly calling genotypes from low - coverage data (naive genotypes) performs the worst in terms of accuracy and missing rate, especially at heterozygous loci. In general, this study shows that by introducing the genetic information of the target population for genotype imputation, the performance of low - coverage genomes in population genetics analysis can be significantly improved, providing strong support for future large - scale population studies.