Topic Attention Encoder: A Self-Supervised Approach for Short Text Clustering.

Jian Jin,Haiyuan Zhao,Ping Ji
DOI: https://doi.org/10.1177/0165551520977453
2020-01-01
Journal of Information Science
Abstract:Short text clustering is a challenging and important task in many practical applications. However, many Bag-of-Word–based methods for short text clustering are often limited by the sparsity of text representation, while many sentence embedding–based methods fail to capture the document structure dependencies within a text corpus. In considerations of the shortcomings of many existing studies, a topic attention encoder (TAE) is proposed in this study. Given topics derived from corpus by the techniques of topic modelling, the cross-document information is introduced. This encoder assumes the document-topic vector to be the learning target and the concatenating vectors of the word embedding and corresponding topic-word vector to be the input. Also, a self-attention mechanism is employed in the encoder, which aims to extract weights of hidden states adaptively and encode the semantics of each short text document. With captured global dependencies and local semantics, TAE integrates the superiority of Bag-of-Word methods and sentence embedding methods. Finally, categories of benchmarking experiments were conducted by analysing three public data sets. It demonstrates that the proposed TAE outperforms many document representation benchmark methods for short text clustering.
What problem does this paper attempt to address?