Abstract:Genome-wide association studies (GWAS) can identify common alleles that contribute to complex disease susceptibility. Despite the large number of SNPs assessed in each study, the effects of most common SNPs must be evaluated indirectly using either genotyped markers or haplotypes thereof as proxies. We have previously implemented a computationally efficient Markov Chain framework for genotype imputation and haplotyping in the freely available MaCH software package. The approach describes sampled chromosomes as mosaics of each other and uses available genotype and shotgun sequence data to estimate unobserved genotypes and haplotypes, together with useful measures of the quality of these estimates. Our approach is already widely used to facilitate comparison of results across studies as well as meta-analyses of GWAS. Here, we use simulations and experimental genotypes to evaluate its accuracy and utility, considering choices of genotyping panels, reference panel configurations, and designs where genotyping is replaced with shotgun sequencing. Importantly, we show that genotype imputation not only facilitates cross study analyses but also increases power of genetic association studies. We show that genotype imputation of common variants using HapMap haplotypes as a reference is very accurate using either genome-wide SNP data or smaller amounts of data typical in fine-mapping studies. Furthermore, we show the approach is applicable in a variety of populations. Finally, we illustrate how association analyses of unobserved variants will benefit from ongoing advances such as larger HapMap reference panels and whole genome shotgun sequencing technologies.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to estimate unobserved genotypes and haplotypes more effectively in genome - wide association studies (GWAS). Specifically, the paper explores how to use existing genotype data and sequence data to estimate unobserved genotypes and haplotypes, as well as the quality of these estimates. The main contribution of the paper is to propose a method (MaCH) based on the Markov chain framework, which can improve the accuracy of genotype estimation, thereby enhancing the statistical power of genetic association studies and facilitating the comparison of results and meta - analysis between different studies. ### Main problems 1. **Improving the accuracy of genotype estimation**: By using the Markov chain framework, the paper proposes a new algorithm (MaCH) that can estimate unobserved genotypes and haplotypes more accurately in large - scale genomic data. 2. **Enhancing the statistical power of genetic association studies**: Through more accurate genotype estimation, the ability to detect complex - disease - related variants can be increased. 3. **Facilitating the comparison of results and meta - analysis between different studies**: Through a unified genotype estimation method, it is easier to compare and integrate the results of different studies. ### Method overview - **Markov chain framework**: MaCH uses the Markov chain framework to describe sample chromosomes as mosaics of other chromosomes and uses available genotype and short - read sequence data to estimate unobserved genotypes and haplotypes. - **Quality assessment**: The paper evaluates the performance of MaCH through simulated and experimental data, including the number of mis - estimated genotypes, the number of bits to be flipped, and the number of completely correct haplotypes. - **Application example**: The paper applies the MaCH method in the Finland - United States Diabetes Genetics Study (FUSION GWAS) to evaluate its performance in real - data. ### Results - **High accuracy**: MaCH performs better than or is comparable to existing methods on multiple datasets, especially showing higher accuracy when estimating low - frequency variants. - **Increased statistical power**: Through genotype estimation, the statistical power of genetic association studies can be significantly improved, especially in detecting complex - disease - related variants. - **Wide applicability**: The MaCH method is applicable to multiple populations and different genotyping platforms, increasing its flexibility and reliability in practical applications. ### Conclusion The paper demonstrates the superior performance of the MaCH method in genotype estimation and haplotype reconstruction, providing a powerful tool for genome - wide association studies and helping to discover more genetic variants related to complex diseases.

MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes

Large-scale Genotyping of Complex DNA

Accelerating Haplotype-Based Genome-Wide Association Study Using Perfect Phylogeny and Phase-Known Reference Data

Fast and accurate haplotype inference with hidden markov model

Genotype Imputation of MetabochipSNPs Using a Study‐Specific Reference Panel of ∼4,000 Haplotypes in African Americans from the Women's Health Initiative

Haplotype Block Partitioning and Tag SNP Selection Using Genotype Data and Their Applications to Association Studies

[Analysis and Application of SNP and Haplotype in the Human Genome].

Alternative Methods for H1 Simulations in Genome Wide Association Studies

HAPRAP: a haplotype-based iterative method for statistical fine mapping using GWAS summary statistics

MaCH-admix: Genotype Imputation for Admixed Populations.

Searching Genome-Wide Multi-Locus Associations for Multiple Diseases Based on Bayesian Inference.

A Fast and Flexible Statistical Model for Large-Scale Population Genotype Data: Applications to Inferring Missing Genotypes and Haplotypic Phase

Simpute: A Simple Genotype Imputation Method

JS-MA: A Jensen-Shannon Divergence Based Method for Mapping Genome-wide Associations on Multiple Diseases

Integrating common and rare genetic variation in diverse human populations

Simpute: an Efficient Solution for Dense Genotypic Data

Comparison of haplotype inference methods using genotypic data from unrelated individuals.

Multi-ethnic Imputation System (MI-System): A genotype imputation server for high-dimensional data

Methods for multiancestry genome‐wide association study meta‐analysis

DAM: A Bayesian Method for Detecting Genome-wide Associations on Multiple Diseases.

Optimization Methods for Genotype Data Analysis in Epidemiological Studies