Abstract:The important task of correcting label noise is addressed infrequently in literature. The difficulty of developing a robust label correction algorithm leads to this silence concerning label correction. To break the silence, we propose two algorithms to correct label noise. One utilizes self-training to re-label noise, called Self-Training Correction (STC). Another is a clustering-based method, which groups instances together to infer their ground-truth labels, called Cluster-based Correction (CC). We also adapt an algorithm from previous work, a consensus-based method called Polishing that consults with an ensemble of classifiers to change the values of attributes and labels. We simplify Polishing such that it only alters labels of instances, and call it Polishing Labels (PL). We experimentally compare our novel methods with Polishing Labels by examining their improvements on the label qualities, model qualities, and AUC metrics of binary and multi-class data sets under different noise levels. Our experimental results demonstrate that CC significantly improves label qualities, model qualities, and AUC metrics consistently. We further investigate how these three noise correction algorithms improve the data quality, in terms of label accuracy, in the context of image labeling in crowdsourcing. First, we look at three consensus methods for inferring a ground-truth label from the multiple noisy labels obtained from crowdsourcing, i.e., Majority Voting (MV), Dawid Skene (DS), and KOS. We then apply the three noise correction methods to correct labels inferred by these consensus methods. Our experimental results show that the noise correction methods improve the labeling quality significantly. As an overall result of our experiments, we conclude that CC performs the best. Our research has illustrated the viability of implementing noise correction as another line of defense against labeling error, especially in a crowdsourcing setting. Furthermore, it presents the feasibility of the automation of an otherwise manual process of analyzing a data set, and correcting and cleaning the instances, an expensive and time-consuming task. (C) 2016 Elsevier Ltd. All rights reserved.

Active label cleaning for improved dataset quality under resource constraints

OT Cleaner: Label Correction As Optimal Transport

CHEF: A Cheap and Fast Pipeline for Iteratively Cleaning Label Uncertainties (Technical Report)

Label Smarter, Not Harder: CleverLabel for Faster Annotation of Ambiguous Image Classification with Higher Quality

Clean or Annotate: How to Spend a Limited Data Collection Budget

Active Label Refinement for Robust Training of Imbalanced Medical Image Classification Tasks in the Presence of High Label Noise

Improving Active Learning by Data Balance to Reduce Annotation Efforts

Learning Image Labels On-the-fly for Training Robust Classification Models

Deep Self-Cleansing for Medical Image Segmentation with Noisy Labels

An Efficient High-Quality Medical Lesion Image Data Labeling Method Based on Active Learning.

An Empirical Study of Automated Mislabel Detection in Real World Vision Datasets

Noisy Label Learning for Large-scale Medical Image Classification

ActiveClean: Interactive Data Cleaning While Learning Convex Loss Models

Active Label-Denoising Algorithm Based on Broad Learning for Annotation of Machine Health Status

Label Noise Correction and Application in Crowdsourcing

Label Critic: Design Data Before Models

Intrinsic Self-Supervision for Data Quality Audits

Deep learning with noisy labels: Exploring techniques and remedies in medical image analysis

Active Learning with Label Quality Control

LABELNET: Recovering Noisy Labels