Abstract:Oxford Nanopore Technologies’ (ONT) long read sequencers offer access to longer DNA fragments than previous sequencer generations, at the cost of a higher error rate. While many papers have studied read correction methods, few have addressed the detailed characterization of observed errors, a task complicated by frequent changes in chemistry and software in ONT technology. The MinION sequencer is now more stable and this paper proposes an up-to-date view of its error landscape, using the most mature flowcell and basecaller. We studied Nanopore sequencing error biases on both bacterial and human DNA reads. We found that, although Nanopore sequencing is expected not to suffer from GC bias, it is a crucial parameter with respect to errors. In particular, low-GC reads have fewer errors than high-GC reads (about 6% and 8% respectively). The error profile for homopolymeric regions or regions with short repeats, the source of about half of all sequencing errors, also depends on the GC rate and mainly shows deletions, although there are some reads with long insertions. Another interesting finding is that the quality measure, although over-estimated, offers valuable information to predict the error rate as well as the abundance of reads. We supplemented this study with an analysis of a rapeseed RNA read set and shown a higher level of errors with a higher level of deletion in these data. Finally, we have implemented an open source pipeline for long-term monitoring of the error profile, which enables users to easily compute various analysis presented in this work, including for future developments of the sequencing device. Overall, we hope this work will provide a basis for the design of better error-correction methods.

Error-Correcting Codes for Nanopore Sequencing

Correcting a Single Deletion in Reads from a Nanopore Sequencer

An End-to-End Coding Scheme for DNA-Based Data Storage With Nanopore-Sequenced Reads

On the Asymptotic Rate of Optimal Codes that Correct Tandem Duplications for Nanopore Sequencing

Unrestricted Error-Type Codebook Generation for Error Correction Code in DNA Storage Inspired by NLP

Capacity-Approaching Constrained Codes with Error Correction for DNA-Based Data Storage

Concatenated Code Design for Constrained DNA Data Storage with Asymmetric Errors

Concatenated Nanopore DNA Codes

On Codes for the Noisy Substring Channel

Sequencing DNA with nanopores: Troubles and biases

On Coding for an Abstracted Nanopore Channel for DNA Storage

Sequence-Subset Distance and Coding for Error Control in DNA-based Data Storage

Models and Information-Theoretic Bounds for Nanopore Sequencing

Fundamental Bounds and Approaches to Sequence Reconstruction from Nanopore Sequencers

Exact Error Exponents of Concatenated Codes for DNA Storage

An Error Correction Method of Nanopore Sequencing Data Using Deep Learning

Error-Correcting Codes for Combinatorial Composite DNA

The Capacity of the Weighted Read Channel

A Segmented-Edit Error-Correcting Code with Re-Synchronization Function for DNA-Based Storage Systems

Error-correcting Codes for Short Tandem Duplication and Substitution Errors

Sequencing coverage analysis for combinatorial DNA-based storage systems