Abstract:Mislabeled examples are ubiquitous in real-world machine learning datasets, advocating the development of techniques for automatic detection. We show that most mislabeled detection methods can be viewed as probing trained machine learning models using a few core principles. We formalize a modular framework that encompasses these methods, parameterized by only 4 building blocks, as well as a Python library that demonstrates that these principles can actually be implemented. The focus is on classifier-agnostic concepts, with an emphasis on adapting methods developed for deep learning models to non-deep classifiers for tabular data. We benchmark existing methods on (artificial) Completely At Random (NCAR) as well as (realistic) Not At Random (NNAR) labeling noise from a variety of tasks with imperfect labeling rules. This benchmark provides new insights as well as limitations of existing methods in this setup.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in actual machine - learning datasets, mislabeled samples are widespread. How to automatically detect these mislabeled samples to improve the quality of the dataset and the performance of the training model? Specifically, the authors focus on developing a method to automatically detect mislabeled samples in supervised - learning datasets. They propose a new perspective, which is to regard the existing mislabeled - detection methods as probes of the trained machine - learning models, and construct a modular framework with four core components to describe these methods. In addition, they develop a Python library to implement this framework and verify the performance of different detection methods under different types of noise through a large number of experiments. ### The main contributions of the paper include: 1. **Concept and Framework**: - Propose a new perspective, regarding most mislabeled - detection methods as probes of the trained machine - learning models. - Construct a modular framework, use four core components to describe these methods, and show how the existing methods fit into this framework. 2. **Investigation and Review**: - Review a large number of existing mislabeled - detection methods, covering deep - learning models and other classical machine - learning algorithms. - Emphasize three common strategies for dealing with specific problems in weakly - supervised learning. 3. **Implementation and Tools**: - Develop a Python library that allows for the rapid instantiation of a large number of existing mislabeled - detection methods and the design and experimentation of new methods. - This library not only implements the core framework but also provides the function of loading existing weakly - supervised datasets to improve the reproducibility of experiments. 4. **Empirical Evaluation**: - Conduct large - scale experiments on multiple text and tabular datasets to evaluate the performance of a series of detectors under different settings. - Vary the noise types, hyperparameter selection methods, and strategies for dealing with the detected mislabeled samples, providing new insights into the behavior of different methods. ### The core content of the paper: 1. **Definition and Problem Statement**: - Define what mislabeled samples are and discuss their ambiguity in the statistical - learning framework. - Distinguish between two cases, deterministic and stochastic, which correspond to the true concepts in the form of a function and a probability distribution respectively. 2. **Trust - Scoring Method**: - Propose the concept of trust - scoring, that is, a proxy indicator for estimating conditional probabilities, thereby distinguishing between true and mislabeled samples. 3. **Assumptions of the Noise - Generation Process**: - Discuss the detector design based on the structure of the noise - generation process, such as the noise - transition matrix, and point out the limitations of these methods. 4. **Data - Region Classification**: - Classify the data regions into four types according to the availability of data, and discuss the challenges of distinguishing rare useful samples from mislabeled samples in each case. 5. **Application Scenarios**: - Explore the applications of mislabeled - detection in practical scenarios such as weakly - supervised learning, crowdsourcing annotation, and web - crawling. 6. **Fully - Automatic Learning Strategy**: - Describe the method of fully - automatic learning through detection and processing strategies in the presence of mislabeled samples, including filtering, semi - supervised learning, and dual - quality learning. In general, this paper aims to provide a comprehensive perspective and practical tools to help researchers and practitioners better understand and deal with the mislabeled - sample problem in machine - learning datasets.

Mislabeled examples detection viewed as probing machine learning models: concepts, survey and extensive benchmark

COMIRE: A Consistence-Based Mislabeled Instances Removal Method

An Empirical Study of Automated Mislabel Detection in Real World Vision Datasets

A Survey of Label-noise Representation Learning: Past, Present and Future

The Word is Mightier than the Label: Learning without Pointillistic Labels using Data Programming

Identifying Mislabeled Data using the Area Under the Margin Ranking

Poisoning the Unlabeled Dataset of Semi-Supervised Learning

An accurate detection is not all you need to combat label noise in web-noisy datasets

A noisy elephant in the room: Is your out-of-distribution detector robust to label noise?

Classification with Label Noise: a Markov Chain Sampling Framework.

Two Wrongs Don't Make a Right: Combating Confirmation Bias in Learning with Label Noise.

See, Say, and Segment: Teaching LMMs to Overcome False Premises

AQuA: A Benchmarking Tool for Label Quality Assessment

On the (Statistical) Detection of Adversarial Examples

Identifying and Correcting Mislabeled Training Instances

The Re-Label Method For Data-Centric Machine Learning

Multi-Instance Learning with One Side Label Noise

Label-Noise Robust Logistic Regression and Its Applications

Latent Causal Probing: A Formal Perspective on Probing with Causal Models of Data

Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance

Mitigating the impact of mislabeled data on deep predictive models: an empirical study of learning with noise approaches in software engineering tasks