Abstract:With the emerging of massive short texts, e.g., social media posts and question titles from Q&A systems, discovering valuable information from them is increasingly significant for many real-world applications of content analysis. The family of topic modeling can effectively explore the hidden structures of documents through the assumptions of latent topics. However, due to the sparseness of short texts, the existing topic models, e.g., latent Dirichlet allocation, lose effectiveness on them. To this end, an effective solution, namely Dirichlet multinomial mixture (DMM), supposing that each short text is only associated with a single topic, indirectly enriches document-level word co-occurrences. However, DMM is sensitive to noisy words, where it often learns inaccurate topic representations at the document level. To address this problem, we extend DMM to a novel Laplacian Dirichlet Multinomial Mixture (LapDMM) topic model for short texts. The basic idea of LapDMM is to preserve local neighborhood structures of short texts, enabling to spread topical signals among neighboring documents, so as to modify the inaccurate topic representations. This is achieved by incorporating the variational manifold regularization into the variational objective of DMM, constraining the close short texts with similar variational topic representations. To find nearest neighbors of short texts, before model inference, we construct an offline document graph, where the distances of short texts can be computed by the word mover's distance. We further develop an online version of LapDMM, namely Online LapDMM, to achieve inference speedup on massive short texts. Carrying this implications, we exploit the spirit of stochastic optimization with mini-batches and an up-to-date document graph that can efficiently find approximate nearest neighbors instead. To evaluate our models, we compare against the state-of-the-art short text topic models on several traditional tasks, i.e., topic quality, document clustering and classification. The empirical results demonstrate that our models achieve very significant performance gains over the baseline models.

A Fast Algorithm for Posterior Inference with Latent Dirichlet Allocation

A New Approach to Speeding Up Topic Modeling

Novel mixture allocation models for topic learning

Collaborative Topic Regression for Online Recommender Systems: an Online and Bayesian Approach

A Comparative Study on Parallel Lda Algorithms in Mapreduce Framework

A Spectral Algorithm for Latent Dirichlet Allocation

Deep Latent Dirichlet Allocation with Topic-Layer-Adaptive Stochastic Gradient Riemannian MCMC

Topic extraction from extremely short texts with variational manifold regularization

Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling

Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units.

DPM-OT: A New Diffusion Probabilistic Model Based on Optimal Transport

Communication-Efficient Parallel Belief Propagation for Latent Dirichlet Allocation

Deep de Finetti: Recovering Topic Distributions from Large Language Models

A Joint Model Of Extended Lda And Ibtm Over Streaming Chinese Short Texts

Sparse online topic models

Online Bayesian Passive-Aggressive Learning

Online Learning Algorithm for Collective Lda

Max-Margin Deep Diverse Latent Dirichlet Allocation With Continual Learning

Prior-Based Dual Additive Latent Dirichlet Allocation for User-Item Connected Documents.

Deep Autoencoding Topic Model With Scalable Hybrid Bayesian Inference

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey