A Small-Sample Text Classification Model Based on Pseudo-Label Fusion Clustering Algorithm

Linda Yang,Baohua Huang,Shiqian Guo,Yunjie Lin,Tong Zhao
DOI: https://doi.org/10.3390/app13084716
2023-04-08
Applied Sciences
Abstract:The problem of text classification has been a mainstream research branch in natural language processing, and how to improve the effect of classification under the scarcity of labeled samples is one of the hot issues in this direction. The current models supporting small-sample classification can learn knowledge and train models with a small number of labels, but the classification results are not satisfactory enough. In order to improve the classification accuracy, we propose a Small-sample Text Classification model based on the Pseudo-label fusion Clustering algorithm (STCPC). The algorithm includes two cores: (1) Mining the potential features of unlabeled data by using the training strategy of clustering assuming pseudo-labeling and then reducing the noise of the pseudo-labeled dataset by consistent training with its enhanced samples to improve the quality of the pseudo-labeled dataset. (2) The labeled data is augmented, and then the Easy Plug-in Data Augmentation (EPiDA) framework is used to balance the diversity and quality of the augmented samples to improve the richness of the labeled data reasonably. The results of comparison tests with other classical algorithms show that the STCPC model can effectively improve classification accuracy.
materials science, multidisciplinary,engineering,chemistry,physics, applied
What problem does this paper attempt to address?