Abstract:Positive-unlabeled learning (PUL) aims at learning a binary classifier from only positive and unlabeled training data. Even though real-world applications often involve imbalanced datasets where the majority of examples belong to one class, most contemporary approaches to PUL do not investigate performance in this setting, thus severely limiting their applicability in practice. In this work, we thus propose to tackle the issues of imbalanced datasets and model calibration in a PUL setting through an uncertainty-aware pseudo-labeling procedure (PUUPL): by boosting the signal from the minority class, pseudo-labeling expands the labeled dataset with new samples from the unlabeled set, while explicit uncertainty quantification prevents the emergence of harmful confirmation bias leading to increased predictive performance. Within a series of experiments, PUUPL yields substantial performance gains in highly imbalanced settings while also showing strong performance in balanced PU scenarios across recent baselines. We furthermore provide ablations and sensitivity analyses to shed light on PUUPL's several ingredients. Finally, a real-world application with an imbalanced dataset confirms the advantage of our approach.

What problem does this paper attempt to address?

This paper mainly discusses how to improve the performance of a classifier in the positive-unlabeled (PU) learning scenario with imbalanced labeled data, by utilizing uncertainty-aware pseudo-label selection (PUUPL). PU learning aims to learn a binary classifier only from positive samples and unlabeled data, without the need for negative samples, which has wide implications in practical applications, especially in the case of imbalanced datasets. The paper points out that although there are some methods attempting to address the class imbalance issue in PU learning, research in this field is still relatively limited. To address this problem, they propose a method called PUUPL, which leverages the uncertainty of model ensembles to select confident unlabeled samples for pseudo-labeling, thereby enhancing the signal of the minority class. In this way, the occurrence of confirmation bias can be prevented, predictive performance can be improved, and the model's calibration can be maintained. In a series of experiments, PUUPL showed significant performance improvements on highly imbalanced datasets, while also demonstrating strong performance in balanced PU scenarios. Furthermore, the paper conducts ablation studies and sensitivity analysis on various components of PUUPL, and validates its advantages on a practical healthcare dataset. Overall, the main contributions of this paper include: 1. Introducing a new framework, PUUPL, which successfully addresses the issue of imbalanced data distribution in PU learning, while maintaining competitiveness on balanced datasets. 2. Demonstrating the superiority of PUUPL on multiple benchmarks and PU datasets, achieving state-of-the-art results on self-training regardless of whether the prior probability of positive samples, π, is known. 3. Applying PUUPL to a real-world healthcare dataset, proving its advantages over other PU learning methods and previous state-of-the-art domain-specific methods. These results indicate that the PUUPL framework is reliable, scalable, and applicable in various real-world scenarios.

Uncertainty-aware Pseudo-label Selection for Positive-Unlabeled Learning

PSPU: Enhanced Positive and Unlabeled Learning by Leveraging Pseudo Supervision

Positive and Unlabeled Learning with Label Disambiguation

Split-PU: Hardness-aware Training Strategy for Positive-Unlabeled Learning

Positive and Unlabeled Learning through Negative Selection and Imbalance-aware Classification

Large-Margin Label-Calibrated Support Vector Machines for Positive and Unlabeled Learning

Positive-Unlabeled Learning with Non-Negative Risk Estimator

Positive Unlabeled Contrastive Learning

Positive Unlabeled Learning Selected Not At Random (PULSNAR): class proportion estimation when the SCAR assumption does not hold

Efficient Training for Positive Unlabeled Learning

In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label Selection Framework for Semi-Supervised Learning

Fairness-aware Model-agnostic Positive and Unlabeled Learning

Meta-learning for Positive-unlabeled Classification

Improving Positive Unlabeled Learning: Practical AUL Estimation and New Training Method for Extremely Imbalanced Data Sets

A boosting framework for positive-unlabeled learning

Loss Decomposition and Centroid Estimation for Positive and Unlabeled Learning

Robust Positive-Unlabeled Learning via Noise Negative Sample Self-correction

PUe: Biased Positive-Unlabeled Learning Enhancement by Causal Inference

Instance-Dependent PU Learning by Bayesian Optimal Relabeling

Positive-Unlabeled Learning by Latent Group-Aware Meta Disambiguation

Augmented prediction of a true class for Positive Unlabeled data under selection bias