Abstract:Many applications require semantic understanding of short texts, and inferring discriminative and coherent latent topics is a critical and fundamental task in these applications. Conventional topic models largely rely on word co-occurrences to derive topics from a collection of documents. However, due to the length of each document, short texts are much more sparse in terms of word co-occurrences. Recent studies show that the Dirichlet Multinomial Mixture (DMM) model is effective for topic inference over short texts by assuming that each piece of short text is generated by a single topic. However, DMM has two main limitations. First, even though it seems reasonable to assume that each short text has only one topic because of its shortness, the definition of “shortness” is subjective and the length of the short texts is dataset dependent. That is, the single-topic assumption may be too strong for some datasets. To address this limitation, we propose to model the topic number as a Poisson distribution, allowing each short text to be associated with a small number of topics (e.g., one to three topics). This model is named PDMM. Second, DMM (and also PDMM) does not have access to background knowledge (e.g., semantic relations between words) when modeling short texts. When a human being interprets a piece of short text, the understanding is not solely based on its content words, but also their semantic relations. Recent advances in word embeddings offer effective learning of word semantic relations from a large corpus. Such auxiliary word embeddings enable us to address the second limitation. To this end, we propose to promote the semantically related words under the same topic during the sampling process, by using the generalized Pólya urn (GPU) model. Through the GPU model, background knowledge about word semantic relations learned from millions of external documents can be easily exploited to improve topic modeling for short texts. By directly extending the PDMM model with the GPU model, we propose two more effective topic models for short texts, named GPU-DMM and GPU-PDMM. Through extensive experiments on two real-world short text collections in two languages, we demonstrate that PDMM achieves better topic representations than state-of-the-art models, measured by topic coherence. The learned topic representation leads to better accuracy in a text classification task, as an indirect evaluation. Both GPU-DMM and GPU-PDMM further improve topic coherence and text classification accuracy. GPU-PDMM outperforms GPU-DMM at the price of higher computational costs.

GLTM: A Global and Local Word Embedding-Based Topic Model for Short Texts.

Short Text Understanding by Leveraging Knowledge into Topic Model.

Exploiting Global Semantic Similarity Biterms for Short-Text Topic Discovery

Topic Discovery for Short Texts Using Word Embeddings

Mining Coherent Topics in Documents Using Word Embeddings and Large-Scale Text Data

Short Text Topic Modeling by Exploring Original Documents

Topic Modeling over Short Texts by Incorporating Word Embeddings

Topic Model over Short Texts Incorporating Word Embedding

Short Text Topic Model with Word Embeddings and Context Information

Short Text Topic Modeling With Flexible Word Patterns

Semantic Augmented Topic Model over Short Text

A CWTM Model of Topic Extraction for Short Text.

Mitigating Data Sparsity for Short Text Topic Modeling by Topic-Semantic Contrastive Learning

Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings

A Semantic Modeling Method for Social Network Short Text Based on Spatial and Temporal Characteristics

Relational Biterm Topic Model: Short-Text Topic Modeling Using Word Embeddings

Short text topic modelling using local and global word-context semantic correlation

Modeling over Short Texts

Topic Modeling for Short Texts Via Dual View Collaborate Optimization

Topic Modeling for Short Texts Via Word Embedding and Document Correlation

BTM: Topic Modeling over Short Texts