Constructing Pseudo Documents With Semantic Similarity For Short Text Topic Discovery

Heng-Yang Lu,Yun Li,Chi Tang,Chong-Jun Wang,Junyuan Xie
DOI: https://doi.org/10.1007/978-3-030-04221-9_39
2018-01-01
Abstract:With the popularity of the Internet, short texts become common in our daily life. Data like tweets and online Q&A pairs are quite valuable in application domains such as question retrieval and personalized recommendation. However, the sparsity problem of short text brings huge challenges for learning topics with conventional topic models. Recently, models like Biterm Topic Model and Word Network Topic Model alleviate the sparsity problem by modeling topics on biterms or pseudo documents. They are encouraged to put words with higher semantic similarity into the same topic by using word co-occurrence. However, there exist many semantically similar words which rarely co-occur. To address this limitation, we propose a model named SEREIN which exploits word embeddings to find more comprehensive semantic representations. Compared with existing models, we improve the performance of topic discovery significantly. Experiments on two open-source and real-world short text datasets also show the effectiveness of involving word embeddings.
What problem does this paper attempt to address?