Abstract:Background Although high-throughput genotyping arrays have made whole-genome association studies (WGAS) feasible, only a small proportion of SNPs in the human genome are actually surveyed in such studies. In addition, various SNP arrays assay different sets of SNPs, which leads to challenges in comparing results and merging data for meta-analyses. Genome-wide imputation of untyped markers allows us to address these issues in a direct fashion. Methods 384 Caucasian American liver donors were genotyped using Illumina 650Y (Ilmn650Y) arrays, from which we also derived genotypes from the Ilmn317K array. On these data, we compared two imputation methods: MACH and BEAGLE. We imputed 2.5 million HapMap Release22 SNPs, and conducted GWAS on ~40,000 liver mRNA expression traits (eQTL analysis). In addition, 200 Caucasian American and 200 African American subjects were genotyped using the Affymetrix 500 K array plus a custom 164 K fill-in chip. We then imputed the HapMap SNPs and quantified the accuracy by randomly masking observed SNPs. Results MACH and BEAGLE perform similarly with respect to imputation accuracy. The Ilmn650Y results in excellent imputation performance, and it outperforms Affx500K or Ilmn317K sets. For Caucasian Americans, 90% of the HapMap SNPs were imputed at 98% accuracy. As expected, imputation of poorly tagged SNPs (untyped SNPs in weak LD with typed markers) was not as successful. It was more challenging to impute genotypes in the African American population, given (1) shorter LD blocks and (2) admixture with Caucasian populations in this population. To address issue (2), we pooled HapMap CEU and YRI data as an imputation reference set, which greatly improved overall performance. The approximate 40,000 phenotypes scored in these populations provide a path to determine empirically how the power to detect associations is affected by the imputation procedures. That is, at a fixed false discovery rate, the number of cis-eQTL discoveries detected by various methods can be interpreted as their relative statistical power in the GWAS. In this study, we find that imputation offer modest additional power (by 4%) on top of either Ilmn317K or Ilmn650Y, much less than the power gain from Ilmn317K to Ilmn650Y (13%). Conclusion Current algorithms can accurately impute genotypes for untyped markers, which enables researchers to pool data between studies conducted using different SNP sets. While genotyping itself results in a small error rate (e.g. 0.5%), imputing genotypes is surprisingly accurate. We found that dense marker sets (e.g. Ilmn650Y) outperform sparser ones (e.g. Ilmn317K) in terms of imputation yield and accuracy. We also noticed it was harder to impute genotypes for African American samples, partially due to population admixture, although using a pooled reference boosts performance. Interestingly, GWAS carried out using imputed genotypes only slightly increased power on top of assayed SNPs. The reason is likely due to adding more markers via imputation only results in modest gain in genetic coverage, but worsens the multiple testing penalties. Furthermore, cis-eQTL mapping using dense SNP set derived from imputation achieves great resolution, and locate associate peak closer to causal variants than conventional approach.

Needles in the Haystack: Identifying Individuals Present in Pooled Genomic Data

Large-scale Genotyping of Complex DNA

Defining Loci in Restriction-Based Reduced Representation Genomic Data from Nonmodel Species: Sources of Bias and Diagnostics for Optimal Clustering

Family-Based Association Tests for Genomewide Association Scans

A Fast and Flexible Statistical Model for Large-Scale Population Genotype Data: Applications to Inferring Missing Genotypes and Haplotypic Phase

A Machine Learning Approach for Missing Persons Cases with High Genotyping Errors

Estimating heterozygosity from a low-coverage genome sequence, leveraging data from other individuals sequenced at the same sites

No major flaws in “Identification of individuals by trait prediction using whole-genome sequencing data”

MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes

Analyses And Comparison Of Accuracy Of Different Genotype Imputation Methods

Extending Rare-Variant Testing Strategies: Analysis of Noncoding Sequence and Imputed Genotypes

Robust Relationship Inference in Genome-Wide Association Studies.

Accuracy of genome-wide imputation of untyped markers and impacts on statistical power for association studies

Detect and Adjust for Population Stratification in Population-Based Association Study Using Genomic Control Markers: an Application of Affymetrix Genechip® Human Mapping 10K Array

Parallel Analysis of 124 Universal Snps for Human Identification by Targeted Semiconductor Sequencing

Ethnic-Affiliation Estimation by Use of Population-Specific Dna Markers

The effect of single nucleotide polymorphism identification strategies on estimates of linkage disequilibrium.

Ultra-Fast Identity by Descent Detection in Biobank-Scale Cohorts Using Positional Burrows-Wheeler Transform.

Effective Selection of Informative SNPs and Classification on the HapMap Genotype Data.

Enhancing testing efficacy of high-density SNP microarrays to distinguish pedigrees belonging to the same kinship class

Using simulated microhaplotype genotyping data to evaluate the value of machine learning algorithms for inferring DNA mixture contributor numbers