Abstract:Many applications require semantic understanding of short texts, and inferring discriminative and coherent latent topics is a critical and fundamental task in these applications. Conventional topic models largely rely on word co-occurrences to derive topics from a collection of documents. However, due to the length of each document, short texts are much more sparse in terms of word co-occurrences. Recent studies show that the Dirichlet Multinomial Mixture (DMM) model is effective for topic inference over short texts by assuming that each piece of short text is generated by a single topic. However, DMM has two main limitations. First, even though it seems reasonable to assume that each short text has only one topic because of its shortness, the definition of “shortness” is subjective and the length of the short texts is dataset dependent. That is, the single-topic assumption may be too strong for some datasets. To address this limitation, we propose to model the topic number as a Poisson distribution, allowing each short text to be associated with a small number of topics (e.g., one to three topics). This model is named PDMM. Second, DMM (and also PDMM) does not have access to background knowledge (e.g., semantic relations between words) when modeling short texts. When a human being interprets a piece of short text, the understanding is not solely based on its content words, but also their semantic relations. Recent advances in word embeddings offer effective learning of word semantic relations from a large corpus. Such auxiliary word embeddings enable us to address the second limitation. To this end, we propose to promote the semantically related words under the same topic during the sampling process, by using the generalized Pólya urn (GPU) model. Through the GPU model, background knowledge about word semantic relations learned from millions of external documents can be easily exploited to improve topic modeling for short texts. By directly extending the PDMM model with the GPU model, we propose two more effective topic models for short texts, named GPU-DMM and GPU-PDMM. Through extensive experiments on two real-world short text collections in two languages, we demonstrate that PDMM achieves better topic representations than state-of-the-art models, measured by topic coherence. The learned topic representation leads to better accuracy in a text classification task, as an indirect evaluation. Both GPU-DMM and GPU-PDMM further improve topic coherence and text classification accuracy. GPU-PDMM outperforms GPU-DMM at the price of higher computational costs.

Topic Model over Short Texts Incorporating Word Embedding

Short Text Understanding by Leveraging Knowledge into Topic Model.

Mining Coherent Topics in Documents Using Word Embeddings and Large-Scale Text Data

Topic Modeling over Short Texts by Incorporating Word Embeddings

Relational Biterm Topic Model: Short-Text Topic Modeling Using Word Embeddings

Modeling over Short Texts

Short Text Topic Modeling With Flexible Word Patterns

A biterm topic model for short texts

Topic Discovery for Short Texts Using Word Embeddings

Semantic Augmented Topic Model over Short Text

BTM: Topic Modeling over Short Texts

Topic Modeling for Short Texts Via Dual View Collaborate Optimization

Biterm Pseudo Document Topic Model for Short Text

Topic Modeling of Short Texts: A Pseudo-Document View with Word Embedding Enhancement.

A Biterm-Based Dirichlet Process Topic Model for Short Texts

Mitigating Data Sparsity for Short Text Topic Modeling by Topic-Semantic Contrastive Learning

Research on Improve Topic Representation over Short Text.

A Joint Model Of Extended Lda And Ibtm Over Streaming Chinese Short Texts

Incorporating Biterm Correlation Knowledge into Topic Modeling for Short Texts

Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings

Sparse Biterm Topic Model for Short Texts