Abstract:High-throughput technology for genotyping has made genome-wide associations possible. Single nucleotide polymorphism (SNP) data derived from array-based technology are usually flawed due to missing data, although they have generally high call rates and good concordance rates across different genotype calling schemes. Missing SNPs can bias the results of association analyses and hence loci with missing data are removed in some studies. Imputation is a method of compensating for the missing data by filling in the most probable values. It can increase the power of the association study and does not involve extra cost to genotype the missing SNPs. In this article, we propose a simple imputation method (Simpute) that takes advantage of the high resolution of SNPs in either the array platform or the mass parallel sequencing platform. It is based on the linkage disequilibrium (LD) structure of the chromosome and only two nearby SNPs are needed to fill in the missing data. Simpute does not use any reference data. We tested this method by randomly masking the genotype data of the international Hap Map phase III project, and the evaluation is made on Chromosome 21. The proposed Simpute algorithm was compared with two algorithms. At highly linked SNP loci, it performs approximately well as BEAGLE, which is a general-purpose algorithm and integrates lots of information. Simpute outperforms the second algorithm proposed by Jung et al., which does not use any reference samples as Simpute. The best feature of Simpute is its computational efficiency with complexity of order, where n is the number of missing SNPs, w is the number of the positions of the missing SNPs and m is the number of people considered. Simpute provides a simple, accurate and fast solution to the whole genome imputation. We have demonstrated that when the SNPs are densely distributed on the chromosome with high linkage disequilibrium between adjacent loci, there is no need to adopt complicated algorithms. Simpute is suitable for regular screening of the large scale SNP genotyping especially when the sample size is large and the efficiency is a major issue of the workflow.

A Novel Efficient Algorithm for Common Variants Genotyping from Low-Coverage Sequencing Data

A New Genotype Imputation Method with Tolerance to High Missing Rate and Rare Variants

Simpute: A Simple Genotype Imputation Method

Large-scale Genotyping of Complex DNA

Simpute: an Efficient Solution for Dense Genotypic Data

The Construction of a Haplotype Reference Panel Using Extremely Low Coverage Whole Genome Sequences and Its Application in Genome-Wide Association Studies and Genomic Prediction in Duroc Pigs.

Benchmarking Imputed Low Coverage Genomes in a Human Population Genetics Context

Performance of Genotype Imputation for Rare Variants Identified in Exons and Flanking Regions of Genes

Performance of Genotype Imputation for Low Frequency and Rare Variants from the 1000 Genomes

Extending Rare-Variant Testing Strategies: Analysis of Noncoding Sequence and Imputed Genotypes

A Comprehensive SNP and Indel Imputability Database

Integrative Analysis of Sequencing and Array Genotype Data for Discovering Disease Associations with Rare Mutations

Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel

cascAGS: Comparative Analysis of SNP Calling Methods for Human Genome Data in the Absence of Gold Standard

MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes

MagicalRsq: Machine-learning-based Genotype Imputation Quality Calibration.

Variant Calling in Low-Coverage Whole Genome Sequencing of a Native American Population Sample

Implication of Next-Generation Sequencing on Association Studies

On Combining Reference Data to Improve Imputation Accuracy

Low-coverage Sequencing: Implications for Design of Complex Trait Association Studies.

Rapid and accurate genotype imputation from low coverage short read, long read, and cell free DNA sequence