Abstract:The latest sequencing technologies such as the Pacific Biosciences (PacBio) and Oxford Nanopore machines can generate long reads at the length of thousands of nucleic bases which is much longer than the reads at the length of hundreds generated by Illumina machines. However, these long reads are prone to much higher error rates, for example, 15 percent, making downstream analysis and applications very difficult. Error correction is a process to improve the quality of sequencing data. Hybrid correction strategies have been recently proposed to combine Illumina reads of low error rates to fix sequencing errors in the noisy long reads with good performance. In this paper, we propose a new method named Bicolor, a bi-level framework of hybrid error correction for further improving the quality of PacBio long reads. At the first level, our method uses a de Bruijn graph-based error correction idea to search paths in pairs of solid k-mers iteratively with an increasing length of k-mer. At the second level, we combine the processed results under different parameters from the first level. In particular, a multiple sequence alignment algorithm is used to align those similar long reads, followed by a voting algorithm which determines the final base at each position of the reads. We compare the superior performance of Bicolor with three state-of-the-art methods on three real data sets. Results demonstrate that Bicolor always achieves the highest identity ratio. Bicolor also achieves a higher alignment ratio (>1.3%) and a higher number of aligned reads than the current methods on two data sets. On the third data set, our method is closely competitive to the current methods in terms of number of aligned reads and genome coverage. The C++ source codes of our algorithm are freely available at https://github.com/yuansliu/Bicolor.

MapReduce for Accurate Error Correction of Next-Generation Sequencing Data

Comprehensive assessment of error correction methods for high-throughput sequencing data

Error filtering, pair assembly and error correction for next-generation sequencing reads

Fec: a Fast Error Correction Method Based on Two-Rounds Overlapping and Caching.

Probabilistic Model Based Error Correction in a Set of Various Mutant Sequences Analyzed by Next-Generation Sequencing

Highly Accurate Fluorogenic DNA Sequencing with Information Theory–based Error Correction

Study of the error correction capability of multiple sequence alignment algorithm (MAFFT) in DNA storage

Chasing Sequencing Perfection: Marching Toward Higher Accuracy and Lower Costs

How Error Correction Affects PCR Deduplication: A Survey Based on UMI Datasets of Short Reads

A Parallel Algorithm for Error Correction in High-Throughput Short-Read Data on CUDA-enabled Graphics Hardware.

Mining Statistically-Solid K-Mers for Accurate NGS Error Correction

An Approach to Correcting DNA Sequencing Error

Quality-Score Guided Error Correction for Short-Read Sequencing Data Using Cuda

Turn ‘noise’ to signal: accurately rectify millions of erroneous short reads through graph learning on edit distances

MEC: Misassembly Error Correction in Contigs Using a Combination of Paired-End Reads and GC-contents

MEC: Misassembly Error Correction in Contigs Based on Distribution of Paired-End Reads and Statistics of GC-contents

Bi-Level Error Correction for PacBio Long Reads

Probabilistic Model Based Error Correction of Various Mutant Sequences Analyzed by the Single-Molecule Real-Time Sequencing

Repeat and haplotype aware error correction in nanopore sequencing reads with DeChat

DUDE-Seq: Fast, Flexible, and Robust Denoising for Targeted Amplicon Sequencing

A Crowdsourcing Method For Correcting Sequencing Errors For The Third-Generation Sequencing Data