Uncertainty-aware Pseudo-label Selection for Positive-Unlabeled Learning

Emilio Dorigatti,Jann Goschenhofer,Benjamin Schubert,Mina Rezaei,Bernd Bischl
2024-03-10
Abstract:Positive-unlabeled learning (PUL) aims at learning a binary classifier from only positive and unlabeled training data. Even though real-world applications often involve imbalanced datasets where the majority of examples belong to one class, most contemporary approaches to PUL do not investigate performance in this setting, thus severely limiting their applicability in practice. In this work, we thus propose to tackle the issues of imbalanced datasets and model calibration in a PUL setting through an uncertainty-aware pseudo-labeling procedure (PUUPL): by boosting the signal from the minority class, pseudo-labeling expands the labeled dataset with new samples from the unlabeled set, while explicit uncertainty quantification prevents the emergence of harmful confirmation bias leading to increased predictive performance. Within a series of experiments, PUUPL yields substantial performance gains in highly imbalanced settings while also showing strong performance in balanced PU scenarios across recent baselines. We furthermore provide ablations and sensitivity analyses to shed light on PUUPL's several ingredients. Finally, a real-world application with an imbalanced dataset confirms the advantage of our approach.
Machine Learning
What problem does this paper attempt to address?
This paper mainly discusses how to improve the performance of a classifier in the positive-unlabeled (PU) learning scenario with imbalanced labeled data, by utilizing uncertainty-aware pseudo-label selection (PUUPL). PU learning aims to learn a binary classifier only from positive samples and unlabeled data, without the need for negative samples, which has wide implications in practical applications, especially in the case of imbalanced datasets. The paper points out that although there are some methods attempting to address the class imbalance issue in PU learning, research in this field is still relatively limited. To address this problem, they propose a method called PUUPL, which leverages the uncertainty of model ensembles to select confident unlabeled samples for pseudo-labeling, thereby enhancing the signal of the minority class. In this way, the occurrence of confirmation bias can be prevented, predictive performance can be improved, and the model's calibration can be maintained. In a series of experiments, PUUPL showed significant performance improvements on highly imbalanced datasets, while also demonstrating strong performance in balanced PU scenarios. Furthermore, the paper conducts ablation studies and sensitivity analysis on various components of PUUPL, and validates its advantages on a practical healthcare dataset. Overall, the main contributions of this paper include: 1. Introducing a new framework, PUUPL, which successfully addresses the issue of imbalanced data distribution in PU learning, while maintaining competitiveness on balanced datasets. 2. Demonstrating the superiority of PUUPL on multiple benchmarks and PU datasets, achieving state-of-the-art results on self-training regardless of whether the prior probability of positive samples, π, is known. 3. Applying PUUPL to a real-world healthcare dataset, proving its advantages over other PU learning methods and previous state-of-the-art domain-specific methods. These results indicate that the PUUPL framework is reliable, scalable, and applicable in various real-world scenarios.