Abstract:Abstract Liquid biopsy holds great promise in noninvasive diagnosis of cancers through detecting minute amounts of cell-free DNA released from cancer cells in non-solid biologic tissue such as peripheral blood. A critical bottleneck in developing liquid biopsy methods is the limited accuracy of current next-generation sequencing technology (NGS), evidenced by its high error rate (0.1%-1%, as of 2018). Through mathematical modeling of NGS errors, we have recently published a method to computationally suppress the current NGS error rate to between 10−5 and 10−4, two orders of magnitude lower than general reports. However, this error rate is a product of both PCR errors and instrument (i.e., sequencer) errors, and it is currently unknown how to separate these error sources. In this work, we developed a novel computational algorithm to precisely measure the errors caused by sequencers. By using 12 publicly available datasets from 10 sequencing centers (in America, Europe, and Asia), we discovered highly reproducible patterns of sequencer errors, including: 1) the overall sequencer error rate is 10−5; 2) at the flow-cell level, error rates are elevated in the bottom surface; 3) almost all flow cells have a small fraction of random tiles with a dramatically elevated error rate; 4) the elevated error rates appear to be enriched in some reaction cycles; 5) removal of these reaction cycles yields 5-fold lower error rates at some genomic loci, so that A>C, A>T, and C>G error types have error rates close to 10−6; and 6) sequencer errors have a pattern markedly distinct from PCR errors. We have implemented the above observations into a general-purpose algorithm, termed CleanDeepSeq2, to computationally suppress sequencer errors and to also effectively monitor sequencer anomalies. CleanDeepSeq2 was engineered for efficiency so that a dataset with ultra-deep sequencing (1,000,000X depth) can be processed in 1.5N minutes on a single CPU core, where N is the number of target regions. Similarly, WES (100X) and WGS (~30X) datasets can be processed in under 1 CPU hour in order to monitor instrument performance. Overall, we have developed a computational method that for the first time enabled precise measurement of sequencer errors. Our study revealed novel insights on sequencer errors that can lead to improved instrumentation, NGS chemistry, and ultimately higher DNA sequencing fidelity. In addition, our developed software can efficiently suppress sequencer errors in addition to previously discovered error sources. Citation Format: Eric Davis, Rain Sun, Ying Shao, Yanling Liu, Heather L. Mulder, Stephen V. Rice, John Easton, Jinghui Zhang, Xiaotu Ma. Uncovering instrument errors in next-generation sequencing by CleanDeepSeq2 [abstract]. In: Proceedings of the AACR Special Conference on Advances in Liquid Biopsies; Jan 13-16, 2020; Miami, FL. Philadelphia (PA): AACR; Clin Cancer Res 2020;26(11_Suppl):Abstract nr A57.

Mining Statistically-Solid K-Mers for Accurate NGS Error Correction

MapReduce for Accurate Error Correction of Next-Generation Sequencing Data

Turn ‘noise’ to signal: accurately rectify millions of erroneous short reads through graph learning on edit distances

High efficiency error suppression for accurate detection of low-frequency variants

Assessment of batch-correction methods for scRNA-seq data with a new test metric

Bi-Level Error Correction for PacBio Long Reads

Instance-based Error Correction for Short Reads of Disease-Associated Genes.

Study of the error correction capability of multiple sequence alignment algorithm (MAFFT) in DNA storage

Enhanced Error Suppression for Accurate Detection of Low‐Frequency Variants

Kmcex: Memory-Frugal and Retrieval-Efficient Encoding of Counted K-Mers.

A test metric for assessing single-cell RNA-seq batch correction

A high-precision genome size estimator based on the k-mer histogram correction

Comprehensive assessment of error correction methods for high-throughput sequencing data

Abstract A57: Uncovering Instrument Errors in Next-Generation Sequencing by CleanDeepSeq2

How Error Correction Affects PCR Deduplication: A Survey Based on UMI Datasets of Short Reads

A frame-based representation of genomic sequences for removing errors and rare variant detection in NGS data

Analysis of error profiles in deep next-generation sequencing data

An Innovative Data Analysis Strategy For Accurate NGS Detection of Tumor mtDNA Mutations

Repeat and haplotype aware error correction in nanopore sequencing reads with DeChat

Too many needles in this haystack: algorithms for the analysis of next generation sequence data

Chasing Sequencing Perfection: Marching Toward Higher Accuracy and Lower Costs