Abstract:Nanopore sequencing is regarded as one of the most promising third-generation sequencing (TGS) technologies. Since 2014, Oxford Nanopore Technologies (ONT) has developed a series of devices based on nanopore sequencing to produce very long reads, with an expected impact on genomics. However, the nanopore sequencing reads are susceptible to a fairly high error rate owing to the difficulty in identifying the DNA bases from the complex electrical signals. Although several basecalling tools have been developed for nanopore sequencing over the past years, it is still challenging to correct the sequences after applying the basecalling procedure. In this study, we developed an open-source DNA basecalling reviser, NanoReviser, based on a deep learning algorithm to correct the basecalling errors introduced by current basecallers provided by default. In our module, we re-segmented the raw electrical signals based on the basecalled sequences provided by the default basecallers. By employing convolution neural networks (CNNs) and bidirectional long short-term memory (Bi-LSTM) networks, we took advantage of the information from the raw electrical signals and the basecalled sequences from the basecallers. Our results showed NanoReviser, as a post-basecalling reviser, significantly improving the basecalling quality. After being trained on standard ONT sequencing reads from public E. coli and human NA12878 datasets, NanoReviser reduced the sequencing error rate by over 5% for both the E. coli dataset and the human dataset. The performance of NanoReviser was found to be better than those of all current basecalling tools. Furthermore, we analyzed the modified bases of the E. coli dataset and added the methylation information to train our module. With the methylation annotation, NanoReviser reduced the error rate by 7% for the E. coli dataset and specifically reduced the error rate by over 10% for the regions of the sequence rich in methylated bases. To the best of our knowledge, NanoReviser is the first post-processing tool after basecalling to accurately correct the nanopore sequences without the time-consuming procedure of building the consensus sequence. The NanoReviser package is freely available at https://github.com/pkubioinformatics/NanoReviser.

De Novo Nanopore Read Quality Improvement Using Deep Learning

MiniScrub: de novo long read scrubbing using approximate alignment and deep learning

NanoReviser: an Error-correction Tool for Nanopore Sequencing Based on a Deep Learning Algorithm

An Error Correction Method of Nanopore Sequencing Data Using Deep Learning

NanoDeep: a deep learning framework for nanopore adaptive sampling on microbial sequencing

NeuralPolish: a Novel Nanopore Polishing Method Based on Alignment Matrix Construction and Orthogonal Bi-GRU Networks.

NextDenovo: an efficient error correction and accurate assembly tool for noisy long reads

Deepsimulator: A Deep Simulator For Nanopore Sequencing

Fast and Accurate Assembly of Nanopore Reads Via Progressive Error Correction and Adaptive Read Selection

NanoSNP: a progressive and haplotype-aware SNP caller on low-coverage nanopore sequencing data

Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome

Single-cell RNA-seq Denoising Using a Deep Count Autoencoder

Repeat and haplotype aware error correction in nanopore sequencing reads with DeChat

Efficient assembly of nanopore reads via highly accurate and intact error correction

From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy

Correcting modification-mediated errors in nanopore sequencing by nucleotide demodification and reference-based correction

BaseNet: A Transformer-Based Toolkit for Nanopore Sequencing Signal Decoding

BlockPolish: Accurate Polishing of Long-Read Assembly Via Block Divide-and-conquer

Performance of neural network basecalling tools for Oxford Nanopore sequencing

An Iterative Approach to Polish the Nanopore Sequencing Basecalling for Therapeutic RNA Quality Control

De novo clustering of long reads by gene from transcriptomics data