AuxMix: Semi-Supervised Learning with Unconstrained Unlabeled Data

Amin Banitalebi-Dehkordi,Pratik Gujjar,Yong Zhang
DOI: https://doi.org/10.48550/arXiv.2206.06959
2022-06-15
Abstract:Semi-supervised learning (SSL) has seen great strides when labeled data is scarce but unlabeled data is abundant. Critically, most recent work assume that such unlabeled data is drawn from the same distribution as the labeled data. In this work, we show that state-of-the-art SSL algorithms suffer a degradation in performance in the presence of unlabeled auxiliary data that does not necessarily possess the same class distribution as the labeled set. We term this problem as Auxiliary-SSL and propose AuxMix, an algorithm that leverages self-supervised learning tasks to learn generic features in order to mask auxiliary data that are not semantically similar to the labeled set. We also propose to regularize learning by maximizing the predicted entropy for dissimilar auxiliary samples. We show an improvement of 5% over existing baselines on a ResNet-50 model when trained on CIFAR10 dataset with 4k labeled samples and all unlabeled data is drawn from the Tiny-ImageNet dataset. We report competitive results on several datasets and conduct ablation studies.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the performance degradation of existing SSL algorithms when using unconstrained unlabeled data (i.e., auxiliary data) in semi - supervised learning (SSL). Specifically, most existing SSL methods assume that the unlabeled data comes from the same distribution as the labeled data. However, in practical applications, this is often not the case. The unlabeled data may come from different distributions, which can lead to a significant decline in model performance. For example, when using unlabeled data from different datasets, even if the labeled datasets are the same, the classification accuracy of the model may be greatly reduced. The authors of the paper propose a new problem framework - Auxiliary - SSL, and propose a new algorithm - AuxMix for this problem. This algorithm learns general features through self - supervised learning tasks to mask those auxiliary data that are not semantically similar to the labeled dataset, and regularizes the learning process by maximizing the entropy of predictions for dissimilar auxiliary samples. Experimental results show that AuxMix can effectively improve model performance when dealing with the label distribution mismatch problem caused by auxiliary data, especially when using unlabeled data from different datasets.