Self-training Method Based on GCN for Semi-Supervised Short Text Classification

Hongyan Cui,Gangkun Wang,Yuanxin Li,Roy E. Welsch
DOI: https://doi.org/10.1016/j.ins.2022.07.186
IF: 8.1
2022-01-01
Information Sciences
Abstract:Semi-supervised short text classification is a challenging problem due to the sparsity and limited labeled data. Due to the lack of labeled data, many models focus on the generation of text samples, which is cumbersome and has poor scalability. To overcome this deficiency, in this paper, we propose a Self-Training Text method based on Graph Convolutional Networks (ST-Text-GCN). Differently from the previous literature, our self-training method is convenient. The labeled information is propagated to target samples along the structure of the manifold, instead of introducing the extra knowledge. Specifically, instead of adding text training samples, our method adds keywords to training set. The model will calculate the confidence of each word. Confidence indicates the degree of ambiguity of a word. Some words with high confidence are automatically marked as pseudo-labeled data. Meanwhile, word confidence is added to the calculation of the edge weights of the graph to reduce the classification error caused by word ambiguity. Our method makes full use of the keywords in short texts when labeled data is scarce. Extensive experimental results have demonstrated that our proposed method outperforms state-of-the-art models on multiple benchmark datasets.
What problem does this paper attempt to address?