ReadsClean: a new approach to error correction of sequencing reads based on alignments clustering

Oleg Fokin,Anastasia Bakulina,Igor Seledtsov,Victor Solovyev
DOI: https://doi.org/10.48550/arXiv.1907.12718
2019-07-30
Abstract:Motivation: Next generation methods of DNA sequencing produce relatively high rate of reading errors, which interfere with de novo genome assembly of newly sequenced organisms and particularly affect the quality of SNP detection important for diagnostics of many hereditary diseases. There exists a number of programs developed for correcting errors in NGS reads. Such programs utilize various approaches and are optimized for different specific tasks, but all of them are far from being able to correct all errors, especially in sequencing reads that crossing by repeats and DNA from di/polyploid eukaryotic genomes. Results: This paper describes a novel method of error correction based on clustering of alignments of similar reads. This method is implemented in ReadsClean program, which is designed for cleaning Illumina HiSeq sequencing reads. We compared ReadsClean to other reads cleaning programs recognized to be the best by several publications. Our sequence assembly tests using actual and simulated sequencing reads show superior results achieved by ReadsClean. Availability and implementation: ReadsClean is implemented as a standalone C code. It is incorporated in an error correction pipeline and is freely available to academic users at Softberry web server <a class="link-external link-http" href="http://www.softberry.com" rel="external noopener nofollow">this http URL</a>.
Genomics
What problem does this paper attempt to address?