Unleash the Power of Inconsistency-Based Semi-Supervised Active Learning by Dynamic Programming of Curriculum Learning
Jiannan Guo,Yangyang Kang,Xiaolin Li,Wenqiao Zhang,Kun Kuang,Changlong Sun,Siliang Tang,Fei Wu
DOI: https://doi.org/10.1109/tkde.2024.3417235
2024-01-01
Abstract:In the training procedures of many real-world learning models, gathering and annotating decent amounts of labeled data can be cost-prohibitive. To mitigate this data-hungry problem, active learning (AL) and semi-supervised learning (SSL) are frequently adopted as two effective but often isolated means. Some recent studies explored the potential of combining AL and SSL to better probe the unlabeled data. However, almost all these contemporary SSL-AL works use a simple combination strategy, ignoring SSL and AL's inherent relation. Further, other methods suffer from high computational costs when dealing with large-scale, high-dimensional datasets. Motivated by the industry practice of labeling data, we firstly propose an innovative I nconsistency-based virtual a D v E rsarial A ctive L earning (IDEAL) algorithm to further investigate SSL-AL's potential superiority and achieve mutual enhancement of AL and SSL, i.e., SSL propagates label information to unlabeled samples and provides smoothed embeddings for AL, while AL excludes samples with inconsistent predictions and considerable uncertainty for SSL. We estimate unlabeled samples' inconsistency by augmentation strategies of different granularities, including fine-grained continuous perturbation exploration and coarse-grained data transformations. Moreover, to solve the problem that the utilization efficiency of unlabeled samples is still insufficient in the process of semi-supervised training, we extend our IDEAL to a curriculum-guided version, namely SPL-IDEAL algorithm. The SPL-IDEAL algorithm can regularize the training process towards better regions in parameter space and denoise the pseudo labels with low confidence, achieving better performance. The extensive experiments, in both text and image benchmark datasets, validate the effectiveness of our proposed IDEAL and SPL-IDEAL algorithms, comparing them against state-of-the-art baselines. Two real-world case studies visualize the practical industrial value of applying and deploying the proposed data sampling algorithms.