Incremental Self-training for Semi-supervised Learning

Jifeng Guo,Zhulin Liu,Tong Zhang,C. L. Philip Chen
2024-04-14
Abstract:Semi-supervised learning provides a solution to reduce the dependency of machine learning on labeled data. As one of the efficient semi-supervised techniques, self-training (ST) has received increasing attention. Several advancements have emerged to address challenges associated with noisy pseudo-labels. Previous works on self-training acknowledge the importance of unlabeled data but have not delved into their efficient utilization, nor have they paid attention to the problem of high time consumption caused by iterative learning. This paper proposes Incremental Self-training (IST) for semi-supervised learning to fill these gaps. Unlike ST, which processes all data indiscriminately, IST processes data in batches and priority assigns pseudo-labels to unlabeled samples with high certainty. Then, it processes the data around the decision boundary after the model is stabilized, enhancing classifier performance. Our IST is simple yet effective and fits existing self-training-based semi-supervised learning methods. We verify the proposed IST on five datasets and two types of backbone, effectively improving the recognition accuracy and learning speed. Significantly, it outperforms state-of-the-art competitors on three challenging image classification tasks.
Machine Learning
What problem does this paper attempt to address?
The paper aims to address several key issues in self-training methods for semi-supervised learning: 1. **Pseudo-label noise problem**: Traditional self-training methods may generate incorrect pseudo-labels during the iterative process, leading to a decline in model performance. 2. **Insufficient utilization of unlabeled data**: Existing works, although emphasizing the importance of unlabeled data, fail to effectively utilize these data. 3. **High time consumption**: Multiple queries and clustering operations during the iterative learning process result in prolonged training time. To tackle these issues, the paper proposes the Incremental Self-training (IST) method. IST improves traditional self-training methods in the following ways: - **Batch processing of unlabeled data**: IST first clusters all unlabeled samples and prioritizes assigning pseudo-labels to easily classifiable samples based on the clustering results, thereby enhancing the early performance of the base classifier. - **Introduction of a sequential query list**: By forming a query list based on sample certainty, IST reduces multiple clustering and query operations, thus accelerating the iterative learning process. - **Utilization of samples near the decision boundary**: After the model stabilizes, IST focuses on handling samples near the decision boundary, further improving classifier performance. Experimental results show that IST significantly improves recognition accuracy and reduces training time on multiple benchmark datasets, outperforming existing state-of-the-art methods.