Abstract:Genotype data include errors that may influence conclusions reached by downstream statistical analyses. Previous studies have estimated genotype error rates from discrepancies in human pedigree data, such as Mendelian inconsistent genotypes or apparent phase violations. However, uncalled deletions, which generally have not been accounted for in these studies, can lead to biased error rate estimates. In this study, we propose a genotype error model that considers both genotype errors and uncalled deletions when calculating the likelihood of the observed genotypes in parent-offspring trios. Using simulations, we show that when there are uncalled deletions, our model produces genotype error rate estimates that are less biased than estimates from a model that does not account for these deletions. We applied our model to SNVs in 77 sequenced White British parent-offspring trios in the UK Biobank. We use the Akaike information criterion to show that our model fits the data better than a model that does not account for uncalled deletions. We estimate the genotype error rate at SNVs with minor allele frequency > 0.001 in these data to be . We estimate that 77% of the genotype errors at these markers are attributable to uncalled deletions . A genotype error occurs when the genotype identified through molecular analysis does not match the actual genotype of the individual being analyzed. Because genotype errors can influence downstream statistical results, previous studies have attempted to estimate the rate of genotype errors in a study sample. However, uncalled deletions, which generally have not been accounted for in these studies, can lead to biased error rate estimates. In this study, we formulate a model adjusting for uncalled deletions when estimating genotype error rates. We show that when uncalled deletions are present, this model results in less biased estimates of genotype error rates compared to a model that does not adjust for uncalled deletions. We apply this model to SNVs in 77 sequenced White British parent-offspring trios in the UK Biobank and estimate the genotype error rate and the proportion of genotype errors that are attributable to uncalled deletions at SNVs with minor allele frequency > 0.001.

Probabilistic Model Based Error Correction in a Set of Various Mutant Sequences Analyzed by Next-Generation Sequencing

Probabilistic Model Based Error Correction of Various Mutant Sequences Analyzed by the Single-Molecule Real-Time Sequencing

MapReduce for Accurate Error Correction of Next-Generation Sequencing Data

Analysis of Mutational Genotyping Using Correctable Decoding Sequencing with Superior Specificity

Correcting modification-mediated errors in nanopore sequencing by nucleotide demodification and reference-based correction

Predicting Pathology of Missense Mutations through Protein-Specific Evolutionary Pattern

Codon-Based Sequence Alignment for Mutation Analysis by High-Throughput Sequencing

Quantification of the effect of mutations using a global probability model of natural sequence variation

Comprehensive assessment of error correction methods for high-throughput sequencing data

Turn ‘noise’ to signal: accurately rectify millions of erroneous short reads through graph learning on edit distances

Error analysis of the PacBio sequencing CCS reads

A novel coding method for gene mutation correction during protein translation process

Single-sample SNP Detection by Empirical Bayesian Method Using Next Generation Sequencing Data

Too many needles in this haystack: algorithms for the analysis of next generation sequence data

Evaluation of Ancestral Sequence Reconstruction Methods to Infer Nonstationary Patterns of Nucleotide Substitution

Instance-based Error Correction for Short Reads of Disease-Associated Genes.

Simultaneous estimation of genotype error and uncalled deletion rates in whole genome sequence data

Error filtering, pair assembly and error correction for next-generation sequencing reads

Using readmers and hapmers in assessing phase switching after read error correction of Oxford Nanopore Sequences

A correctable decoding DNA sequencing with high accuracy and high throughput

Bi-Level Error Correction for PacBio Long Reads