Simpute: A Simple Genotype Imputation Method

Yen Jen Lin,Chun-Tien Chang,Chuan Yi Tang,Wen-Ping Hsieh
DOI: https://doi.org/10.1109/cisis.2012.63
2012-01-01
Abstract:High-throughput technology for genotyping has made genome-wide associations possible. Single nucleotide polymorphism (SNP) data derived from array-based technology are usually flawed due to missing data, although they have generally high call rates and good concordance rates across different genotype calling schemes. Missing SNPs can bias the results of association analyses and hence loci with missing data are removed in some studies. Imputation is a method of compensating for the missing data by filling in the most probable values. It can increase the power of the association study and does not involve extra cost to genotype the missing SNPs. In this article, we propose a simple imputation method (Simpute) that takes advantage of the high resolution of SNPs in either the array platform or the mass parallel sequencing platform. It is based on the linkage disequilibrium (LD) structure of the chromosome and only two nearby SNPs are needed to fill in the missing data. Simpute does not use any reference data. We tested this method by randomly masking the genotype data of the international Hap Map phase III project, and the evaluation is made on Chromosome 21. The proposed Simpute algorithm was compared with two algorithms. At highly linked SNP loci, it performs approximately well as BEAGLE, which is a general-purpose algorithm and integrates lots of information. Simpute outperforms the second algorithm proposed by Jung et al., which does not use any reference samples as Simpute. The best feature of Simpute is its computational efficiency with complexity of order, where n is the number of missing SNPs, w is the number of the positions of the missing SNPs and m is the number of people considered. Simpute provides a simple, accurate and fast solution to the whole genome imputation. We have demonstrated that when the SNPs are densely distributed on the chromosome with high linkage disequilibrium between adjacent loci, there is no need to adopt complicated algorithms. Simpute is suitable for regular screening of the large scale SNP genotyping especially when the sample size is large and the efficiency is a major issue of the workflow.
What problem does this paper attempt to address?