Prototype-Guided Pseudo Labeling for Semi-Supervised Text Classification

Weiyi Yang,Richong Zhang,Junfan Chen,Lihong Wang,Jaein Kim
2023-01-01
Abstract:Semi-supervised text classification (SSTC) aims at text classification with few labeled data and massive unlabeled data. Recent works achieve this task by pseudo-labeling methods, with the belief that the unlabeled and labeled data have identical data distribution, and assign the unlabeled data with pseudo-labels as additional supervision. However, existing pseudo-labeling methods usually suffer from ambiguous categorical boundary issues when training the pseudo-labeling phase, and simply select pseudo-labels without considering the unbalanced categorical distribution of the unlabeled data, making it difficult to generate reliable pseudo-labels for each category. We propose a novel semi-supervised framework, namely ProtoS2, with prototypical cluster separation (PCS) and prototypical-center data selection (CDS) technology to address the issue. Particularly, PCS exploits categorical prototypes to assimilate instance representations within the same category, thus emphasizing low-density separation for the pseudo-labeled data to alleviate ambiguous boundaries. Besides, CDS selects central pseudo-labeled data considering the categorical distribution, avoiding the model from biasing on dominant categories. Empirical studies and extensive analysis with four benchmarks demonstrate the effectiveness of the proposed model.
What problem does this paper attempt to address?