MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes

Yun Li,Cristen J. Willer,Jun Ding,Paul Scheet,Gonçalo R. Abecasis
DOI: https://doi.org/10.1002/gepi.20533
2010-11-05
Genetic Epidemiology
Abstract:Genome-wide association studies (GWAS) can identify common alleles that contribute to complex disease susceptibility. Despite the large number of SNPs assessed in each study, the effects of most common SNPs must be evaluated indirectly using either genotyped markers or haplotypes thereof as proxies. We have previously implemented a computationally efficient Markov Chain framework for genotype imputation and haplotyping in the freely available MaCH software package. The approach describes sampled chromosomes as mosaics of each other and uses available genotype and shotgun sequence data to estimate unobserved genotypes and haplotypes, together with useful measures of the quality of these estimates. Our approach is already widely used to facilitate comparison of results across studies as well as meta-analyses of GWAS. Here, we use simulations and experimental genotypes to evaluate its accuracy and utility, considering choices of genotyping panels, reference panel configurations, and designs where genotyping is replaced with shotgun sequencing. Importantly, we show that genotype imputation not only facilitates cross study analyses but also increases power of genetic association studies. We show that genotype imputation of common variants using HapMap haplotypes as a reference is very accurate using either genome-wide SNP data or smaller amounts of data typical in fine-mapping studies. Furthermore, we show the approach is applicable in a variety of populations. Finally, we illustrate how association analyses of unobserved variants will benefit from ongoing advances such as larger HapMap reference panels and whole genome shotgun sequencing technologies.
genetics & heredity,mathematical & computational biology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to estimate unobserved genotypes and haplotypes more effectively in genome - wide association studies (GWAS). Specifically, the paper explores how to use existing genotype data and sequence data to estimate unobserved genotypes and haplotypes, as well as the quality of these estimates. The main contribution of the paper is to propose a method (MaCH) based on the Markov chain framework, which can improve the accuracy of genotype estimation, thereby enhancing the statistical power of genetic association studies and facilitating the comparison of results and meta - analysis between different studies. ### Main problems 1. **Improving the accuracy of genotype estimation**: By using the Markov chain framework, the paper proposes a new algorithm (MaCH) that can estimate unobserved genotypes and haplotypes more accurately in large - scale genomic data. 2. **Enhancing the statistical power of genetic association studies**: Through more accurate genotype estimation, the ability to detect complex - disease - related variants can be increased. 3. **Facilitating the comparison of results and meta - analysis between different studies**: Through a unified genotype estimation method, it is easier to compare and integrate the results of different studies. ### Method overview - **Markov chain framework**: MaCH uses the Markov chain framework to describe sample chromosomes as mosaics of other chromosomes and uses available genotype and short - read sequence data to estimate unobserved genotypes and haplotypes. - **Quality assessment**: The paper evaluates the performance of MaCH through simulated and experimental data, including the number of mis - estimated genotypes, the number of bits to be flipped, and the number of completely correct haplotypes. - **Application example**: The paper applies the MaCH method in the Finland - United States Diabetes Genetics Study (FUSION GWAS) to evaluate its performance in real - data. ### Results - **High accuracy**: MaCH performs better than or is comparable to existing methods on multiple datasets, especially showing higher accuracy when estimating low - frequency variants. - **Increased statistical power**: Through genotype estimation, the statistical power of genetic association studies can be significantly improved, especially in detecting complex - disease - related variants. - **Wide applicability**: The MaCH method is applicable to multiple populations and different genotyping platforms, increasing its flexibility and reliability in practical applications. ### Conclusion The paper demonstrates the superior performance of the MaCH method in genotype estimation and haplotype reconstruction, providing a powerful tool for genome - wide association studies and helping to discover more genetic variants related to complex diseases.