Topic Model over Short Texts Incorporating Word Embedding

Kai Yu,Yiming Zhang,Xu Wang
DOI: https://doi.org/10.2991/aeecs-18.2018.34
2018-01-01
Abstract:Short texts’ data sparsity makes them difficult to find out their document-level word cooccurrence patterns, that’s why conventional topic models like LDA experience a large performance degradation over short texts. As a derivative product of learning neuro probabilistic language model, word embedding can well express semantic similarity of word. In this paper, we propose a new model called promotion-BTM, which promotes the probability that similar words based on word embedding belong to the same topic. It also distinguishes the words of a biterm into topical word and general word, and only promotes topical words’ semantically similar words. Extensive experiments on realworld datasets show that our model exceeds the baseline model BTM on all evaluations.
What problem does this paper attempt to address?