Abstract:With the emerging of massive short texts, e.g., social media posts and question titles from Q&A systems, discovering valuable information from them is increasingly significant for many real-world applications of content analysis. The family of topic modeling can effectively explore the hidden structures of documents through the assumptions of latent topics. However, due to the sparseness of short texts, the existing topic models, e.g., latent Dirichlet allocation, lose effectiveness on them. To this end, an effective solution, namely Dirichlet multinomial mixture (DMM), supposing that each short text is only associated with a single topic, indirectly enriches document-level word co-occurrences. However, DMM is sensitive to noisy words, where it often learns inaccurate topic representations at the document level. To address this problem, we extend DMM to a novel Laplacian Dirichlet Multinomial Mixture (LapDMM) topic model for short texts. The basic idea of LapDMM is to preserve local neighborhood structures of short texts, enabling to spread topical signals among neighboring documents, so as to modify the inaccurate topic representations. This is achieved by incorporating the variational manifold regularization into the variational objective of DMM, constraining the close short texts with similar variational topic representations. To find nearest neighbors of short texts, before model inference, we construct an offline document graph, where the distances of short texts can be computed by the word mover's distance. We further develop an online version of LapDMM, namely Online LapDMM, to achieve inference speedup on massive short texts. Carrying this implications, we exploit the spirit of stochastic optimization with mini-batches and an up-to-date document graph that can efficiently find approximate nearest neighbors instead. To evaluate our models, we compare against the state-of-the-art short text topic models on several traditional tasks, i.e., topic quality, document clustering and classification. The empirical results demonstrate that our models achieve very significant performance gains over the baseline models.

Topic Discovery from Heterogeneous Texts

Short Text Understanding by Leveraging Knowledge into Topic Model.

Mining Coherent Topics in Documents Using Word Embeddings and Large-Scale Text Data

Hierarchical Topic Modeling with Nested Hierarchical Dirichlet Process

Heterogeneous Latent Topic Discovery for Semantic Text Mining

News Topic Discovery Through Community Detection

Topic Modeling over Short Texts by Incorporating Word Embeddings

A Topic Model for Hierarchical Documents

Parsimonious Topic Models with Salient Word Discovery

A Joint Model Of Extended Lda And Ibtm Over Streaming Chinese Short Texts

Topics Modeling Based on Selective Zipf Distribution

Deep de Finetti: Recovering Topic Distributions from Large Language Models

A Self-adaptive Sliding Window Based Topic Model for Non-uniform Texts

Topic Analysis for Text with Side Data

Topic extraction from extremely short texts with variational manifold regularization

Heterogeneous Supervised Topic Models

BTM: Topic Modeling over Short Texts

Topic Discovery and Evolution in Scientific Literature Based on Content and Citations

Topic Discovery for Streaming Short Texts with CTM.

Text Network Exploration Via Heterogeneous Web of Topics

Topic Mining over Asynchronous Text Sequences