You can't handle the (dirty) truth: Data-centric insights improve pseudo-labeling

Nabeel Seedat,Nicolas Huynh,Fergus Imrie,Mihaela van der Schaar
2024-06-20
Abstract:Pseudo-labeling is a popular semi-supervised learning technique to leverage unlabeled data when labeled samples are scarce. The generation and selection of pseudo-labels heavily rely on labeled data. Existing approaches implicitly assume that the labeled data is gold standard and 'perfect'. However, this can be violated in reality with issues such as mislabeling or ambiguity. We address this overlooked aspect and show the importance of investigating labeled data quality to improve any pseudo-labeling method. Specifically, we introduce a novel data characterization and selection framework called DIPS to extend pseudo-labeling. We select useful labeled and pseudo-labeled samples via analysis of learning dynamics. We demonstrate the applicability and impact of DIPS for various pseudo-labeling methods across an extensive range of real-world tabular and image datasets. Additionally, DIPS improves data efficiency and reduces the performance distinctions between different pseudo-labelers. Overall, we highlight the significant benefits of a data-centric rethinking of pseudo-labeling in real-world settings.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily addresses a key issue in pseudo-labeling methods for semi-supervised learning, specifically how to improve the effectiveness of pseudo-labeling techniques when the quality of labeled data is poor. Specifically, the paper addresses the following issues: 1. **Quality issues of labeled data**: Existing pseudo-labeling methods typically assume that labeled data is the "gold standard," but real-world data often has quality issues such as mislabeling or ambiguity. These issues can lead to error propagation during the pseudo-labeling process, thereby affecting model performance. 2. **Improving pseudo-labeling methods**: To address the above challenges, the authors propose a new framework called DIPS (Data-centric Insights for Pseudo-labeling with Selection), which aims to extend traditional pseudo-labeling methods by analyzing learning dynamics to select the most useful labeled and pseudo-labeled samples. 3. **Experimental validation**: The paper demonstrates the effectiveness and practicality of DIPS through a series of experiments, including: - Demonstrating on synthetic datasets that DIPS can significantly improve test accuracy when labeled data is noisy. - Confirming on multiple real-world datasets (including tabular data and image data) that DIPS can enhance the performance of different pseudo-labeling baseline methods and reduce performance differences between these methods. - Showing that even with less labeled data, combining DIPS can achieve performance levels similar to traditional methods. - Verifying that DIPS can still improve performance when labeled and unlabeled data come from different countries. - Exploring the potential application of DIPS on image data. In summary, by proposing the DIPS framework, this study not only focuses on the selection of unlabeled data but also emphasizes the quality issues of labeled data. The effectiveness of this approach is demonstrated through experiments, especially in handling noisy labeled data. This provides new insights for improving existing pseudo-labeling techniques and enhancing the performance of semi-supervised learning methods.