Abstract:Background: Genome-wide association studies have revealed that rare variants are responsible for a large portion of the heritability of some complex human diseases. This highlights the increasing importance of detecting and screening for rare variants. Although the massively parallel sequencing technologies have greatly reduced the cost of DNA sequencing, the identification of rare variant carriers by large-scale re-sequencing remains prohibitively expensive because of the huge challenge of constructing libraries for thousands of samples. Recently, several studies have reported that techniques from group testing theory and compressed sensing could help identify rare variant carriers in large-scale samples with few pooled sequencing experiments and a dramatically reduced cost.Results: Based on quantitative group testing, we propose an efficient overlapping pool sequencing strategy that allows the efficient recovery of variant carriers in numerous individuals with much lower costs than conventional methods. We used random k-set pool designs to mix samples, and optimized the design parameters according to an indicative probability. Based on a mathematical model of sequencing depth distribution, an optimal threshold was selected to declare a pool positive or negative. Then, using the quantitative information contained in the sequencing results, we designed a heuristic Bayesian probability decoding algorithm to identify variant carriers. Finally, we conducted in silico experiments to find variant carriers among 200 simulated Escherichia coli strains. With the simulated pools and publicly available Illumina sequencing data, our method correctly identified the variant carriers for 91.5-97.9% variants with the variant frequency ranging from 0.5 to 1.5%.Conclusions: Using the number of reads, variant carriers could be identified precisely even though samples were randomly selected and pooled. Our method performed better than the published DNA Sudoku design and compressed sequencing, especially in reducing the required data throughput and cost.

Accurate Estimation of Haplotype Frequency from Pooled Sequencing Data and Cost-Effective Identification of Rare Haplotype Carriers by Overlapping Pool Sequencing

Ehapp2: Estimate haplotype frequencies from pooled sequencing data with prior database information

Computationally Feasible Estimation of Haplotype Frequencies from Pooled Dna with and Without Hardy-Weinberg Equilibrium

Identifying rare variants with optimal depth of coverage and cost-effective overlapping pool sequencing.

Quantitative Group Testing-Based Overlapping Pool Sequencing to Identify Rare Variant Carriers

Accurate Haplotype Inference for Multiple Linked Single-Nucleotide Polymorphisms Using Sibship Data

PoooL: an Efficient Method for Estimating Haplotype Frequencies from Large DNA Pools.

CSHAP: Efficient Haplotype Frequency Estimation Based on Sparse Representation.

Integrative Analysis of Sequencing and Array Genotype Data for Discovering Disease Associations with Rare Mutations

PoolHap: Inferring Haplotype Frequencies from Pooled Samples by Next Generation Sequencing.

PERHAPS: Paired-End Short Reads-based HAPlotyping from Next-Generation Sequencing Data.

Efficiency of Single-Nucleotide Polymorphism Haplotype Estimation from Pooled DNA

An Accurate Clone-Based Haplotyping Method by Overlapping Pool Sequencing

Maximum Likelihood Estimation of Frequencies of Known Haplotypes from Pooled Sequence Data

WinHAP: an Efficient Haplotype Phasing Algorithm Based on Scalable Sliding Windows.

MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes

High-accuracy haplotype imputation using unphased genotype data as the references

Biases and errors on allele frequency estimation and disease association tests of next-generation sequencing of pooled samples.

A Study of the Efficiency of Pooling in Haplotype Estimation

Comparison of haplotype inference methods using genotypic data from unrelated individuals.

Simpute: an Efficient Solution for Dense Genotypic Data