Abstract:Motivation: Illumina Sequencing data can provide high coverage of a genome by relatively short (most often 100 bp to 150 bp) reads at a low cost. Even with low (advertised 1%) error rate, 100 × coverage Illumina data on average has an error in some read at every base in the genome. These errors make handling the data more complicated because they result in a large number of low-count erroneous k-mers in the reads. However, there is enough information in the reads to correct most of the sequencing errors, thus making subsequent use of the data (e.g. for mapping or assembly) easier. Here we use the term "error correction" to denote the reduction in errors due to both changes in individual bases and trimming of unusable sequence. We developed an error correction software called QuorUM. QuorUM is mainly aimed at error correcting Illumina reads for subsequent assembly. It is designed around the novel idea of minimizing the number of distinct erroneous k-mers in the output reads and preserving the most true k-mers, and we introduce a composite statistic π that measures how successful we are at achieving this dual goal. We evaluate the performance of QuorUM by correcting actual Illumina reads from genomes for which a reference assembly is available. Results: We produce trimmed and error-corrected reads that result in assemblies with longer contigs and fewer errors. We compared QuorUM against several published error correctors and found that it is the best performer in most metrics we use. QuorUM is efficiently implemented making use of current multi-core computing architectures and it is suitable for large data sets (1 billion bases checked and corrected per day per core). We also demonstrate that a third-party assembler (SOAPdenovo) benefits significantly from using QuorUM error-corrected reads. QuorUM error corrected reads result in a factor of 1.1 to 4 improvement in N50 contig size compared to using the original reads with SOAPdenovo for the data sets investigated. Availability: QuorUM is distributed as an independent software package and as a module of the MaSuRCA assembly software. Both are available under the GPL open source license at http://www.genome.umd.edu. Contact: gmarcais@umd.edu.

ReadsClean: a new approach to error correction of sequencing reads based on alignments clustering

Efficient Frequency-Based De Novo Short-Read Clustering for Error Trimming in Next-Generation Sequencing

Error filtering, pair assembly and error correction for next-generation sequencing reads

Illumina reads correction: evaluation and improvements

ResSeq: Enhancing Short-Read Sequencing Alignment by Rescuing Error-Containing Reads

Repeat and haplotype aware error correction in nanopore sequencing reads with DeChat

Instance-based Error Correction for Short Reads of Disease-Associated Genes.

Scalable long read self-correction and assembly polishing with multiple sequence alignment

Correcting Illumina sequencing errors for human data

HALC: High throughput algorithm for long read error correction

Bi-Level Error Correction for PacBio Long Reads

CleanSeq: A Pipeline for Contamination Detection, Cleanup, and Mutation Verifications from Microbial Genome Sequencing Data

The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies

NextPolish: a fast and efficient genome polishing tool for long-read assembly

MiniScrub: de novo long read scrubbing using approximate alignment and deep learning

Homopolish: a method for the removal of systematic errors in nanopore sequencing by homologous polishing

A Parallel Algorithm for Error Correction in High-Throughput Short-Read Data on CUDA-enabled Graphics Hardware.

Comprehensive assessment of error correction methods for high-throughput sequencing data

GoldPolish-Target: Targeted long-read genome assembly polishing

QuorUM: An Error Corrector for Illumina Reads

Turn ‘noise’ to signal: accurately rectify millions of erroneous short reads through graph learning on edit distances