Abstract:The important task of correcting label noise is addressed infrequently in literature. The difficulty of developing a robust label correction algorithm leads to this silence concerning label correction. To break the silence, we propose two algorithms to correct label noise. One utilizes self-training to re-label noise, called Self-Training Correction (STC). Another is a clustering-based method, which groups instances together to infer their ground-truth labels, called Cluster-based Correction (CC). We also adapt an algorithm from previous work, a consensus-based method called Polishing that consults with an ensemble of classifiers to change the values of attributes and labels. We simplify Polishing such that it only alters labels of instances, and call it Polishing Labels (PL). We experimentally compare our novel methods with Polishing Labels by examining their improvements on the label qualities, model qualities, and AUC metrics of binary and multi-class data sets under different noise levels. Our experimental results demonstrate that CC significantly improves label qualities, model qualities, and AUC metrics consistently. We further investigate how these three noise correction algorithms improve the data quality, in terms of label accuracy, in the context of image labeling in crowdsourcing. First, we look at three consensus methods for inferring a ground-truth label from the multiple noisy labels obtained from crowdsourcing, i.e., Majority Voting (MV), Dawid Skene (DS), and KOS. We then apply the three noise correction methods to correct labels inferred by these consensus methods. Our experimental results show that the noise correction methods improve the labeling quality significantly. As an overall result of our experiments, we conclude that CC performs the best. Our research has illustrated the viability of implementing noise correction as another line of defense against labeling error, especially in a crowdsourcing setting. Furthermore, it presents the feasibility of the automation of an otherwise manual process of analyzing a data set, and correcting and cleaning the instances, an expensive and time-consuming task. (C) 2016 Elsevier Ltd. All rights reserved.

A Crowdsourcing Method For Correcting Sequencing Errors For The Third-Generation Sequencing Data

Integration of Hybrid and Self-Correction Method Improves the Quality of Long-Read Sequencing Data.

An Approach to Correcting DNA Sequencing Error

MapReduce for Accurate Error Correction of Next-Generation Sequencing Data

HALC: High throughput algorithm for long read error correction

NmTHC: a hybrid error correction method based on a generative neural machine translation model with transfer learning

Turn ‘noise’ to signal: accurately rectify millions of erroneous short reads through graph learning on edit distances

Coupled Confusion Correction: Learning from Crowds with Sparse Annotations

Instance-based Error Correction for Short Reads of Disease-Associated Genes.

DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies

Hybrid-hybrid correction of errors in long reads with HERO

ReadsClean: a new approach to error correction of sequencing reads based on alignments clustering

Three-way Decision-Based Noise Correction for Crowdsourcing

Comprehensive assessment of error correction methods for high-throughput sequencing data

Bi-Level Error Correction for PacBio Long Reads

Fec: a Fast Error Correction Method Based on Two-Rounds Overlapping and Caching.

A Parallel Algorithm for Error Correction in High-Throughput Short-Read Data on CUDA-enabled Graphics Hardware.

Quality-Score Guided Error Correction for Short-Read Sequencing Data Using Cuda

Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA

Label Noise Correction and Application in Crowdsourcing

How Error Correction Affects PCR Deduplication: A Survey Based on UMI Datasets of Short Reads