Abstract:Pseudo-labeling is a popular semi-supervised learning technique to leverage unlabeled data when labeled samples are scarce. The generation and selection of pseudo-labels heavily rely on labeled data. Existing approaches implicitly assume that the labeled data is gold standard and 'perfect'. However, this can be violated in reality with issues such as mislabeling or ambiguity. We address this overlooked aspect and show the importance of investigating labeled data quality to improve any pseudo-labeling method. Specifically, we introduce a novel data characterization and selection framework called DIPS to extend pseudo-labeling. We select useful labeled and pseudo-labeled samples via analysis of learning dynamics. We demonstrate the applicability and impact of DIPS for various pseudo-labeling methods across an extensive range of real-world tabular and image datasets. Additionally, DIPS improves data efficiency and reduces the performance distinctions between different pseudo-labelers. Overall, we highlight the significant benefits of a data-centric rethinking of pseudo-labeling in real-world settings.

What problem does this paper attempt to address?

The paper primarily addresses a key issue in pseudo-labeling methods for semi-supervised learning, specifically how to improve the effectiveness of pseudo-labeling techniques when the quality of labeled data is poor. Specifically, the paper addresses the following issues: 1. **Quality issues of labeled data**: Existing pseudo-labeling methods typically assume that labeled data is the "gold standard," but real-world data often has quality issues such as mislabeling or ambiguity. These issues can lead to error propagation during the pseudo-labeling process, thereby affecting model performance. 2. **Improving pseudo-labeling methods**: To address the above challenges, the authors propose a new framework called DIPS (Data-centric Insights for Pseudo-labeling with Selection), which aims to extend traditional pseudo-labeling methods by analyzing learning dynamics to select the most useful labeled and pseudo-labeled samples. 3. **Experimental validation**: The paper demonstrates the effectiveness and practicality of DIPS through a series of experiments, including: - Demonstrating on synthetic datasets that DIPS can significantly improve test accuracy when labeled data is noisy. - Confirming on multiple real-world datasets (including tabular data and image data) that DIPS can enhance the performance of different pseudo-labeling baseline methods and reduce performance differences between these methods. - Showing that even with less labeled data, combining DIPS can achieve performance levels similar to traditional methods. - Verifying that DIPS can still improve performance when labeled and unlabeled data come from different countries. - Exploring the potential application of DIPS on image data. In summary, by proposing the DIPS framework, this study not only focuses on the selection of unlabeled data but also emphasizes the quality issues of labeled data. The effectiveness of this approach is demonstrated through experiments, especially in handling noisy labeled data. This provides new insights for improving existing pseudo-labeling techniques and enhancing the performance of semi-supervised learning methods.

You can't handle the (dirty) truth: Data-centric insights improve pseudo-labeling

Fully Data-Driven Pseudo Label Estimation for Pointly-Supervised Panoptic Segmentation

Debiased Pseudo Labeling in Self-Training

A Review of Pseudo-Labeling for Computer Vision

In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label Selection Framework for Semi-Supervised Learning

Deep Insights into Noisy Pseudo Labeling on Graph Data

Pseudo Labeling Methods for Semi-Supervised Semantic Segmentation: A Review and Future Perspectives

Using Unreliable Pseudo-Labels for Label-Efficient Semantic Segmentation

Why the pseudo label based semi-supervised learning algorithm is effective?

Pseudo-labeling for Scalable 3D Object Detection

Rethinking Pseudo Labels for Semi-Supervised Object Detection

Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech Recognition

Pseudo Label Selection is a Decision Problem

Exploiting Unlabeled Data via Partial Label Assignment for Multi-Class Semi-Supervised Learning

Pseudo-Labeling Enhanced by Privileged Information and Its Application to In Situ Sequencing Images

Decoupled Pseudo-labeling for Semi-Supervised Monocular 3D Object Detection

The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards

A Label is Worth a Thousand Images in Dataset Distillation

How many labelers do you have? A closer look at gold-standard labels

Label Smarter, Not Harder: CleverLabel for Faster Annotation of Ambiguous Image Classification with Higher Quality