Abstract:With the emerging of massive short texts, e.g., social media posts and question titles from Q&A systems, discovering valuable information from them is increasingly significant for many real-world applications of content analysis. The family of topic modeling can effectively explore the hidden structures of documents through the assumptions of latent topics. However, due to the sparseness of short texts, the existing topic models, e.g., latent Dirichlet allocation, lose effectiveness on them. To this end, an effective solution, namely Dirichlet multinomial mixture (DMM), supposing that each short text is only associated with a single topic, indirectly enriches document-level word co-occurrences. However, DMM is sensitive to noisy words, where it often learns inaccurate topic representations at the document level. To address this problem, we extend DMM to a novel Laplacian Dirichlet Multinomial Mixture (LapDMM) topic model for short texts. The basic idea of LapDMM is to preserve local neighborhood structures of short texts, enabling to spread topical signals among neighboring documents, so as to modify the inaccurate topic representations. This is achieved by incorporating the variational manifold regularization into the variational objective of DMM, constraining the close short texts with similar variational topic representations. To find nearest neighbors of short texts, before model inference, we construct an offline document graph, where the distances of short texts can be computed by the word mover's distance. We further develop an online version of LapDMM, namely Online LapDMM, to achieve inference speedup on massive short texts. Carrying this implications, we exploit the spirit of stochastic optimization with mini-batches and an up-to-date document graph that can efficiently find approximate nearest neighbors instead. To evaluate our models, we compare against the state-of-the-art short text topic models on several traditional tasks, i.e., topic quality, document clustering and classification. The empirical results demonstrate that our models achieve very significant performance gains over the baseline models.

A Practical Algorithm for Solving the Sparseness Problem of Short Text Clustering

X-DMM: Fast and Scalable Model Based Text Clustering

Improving Short Text Classification Through Better Feature Space Selection

An Adaptive Dirichlet Multinomial Mixture Model for Short Text Streaming Clustering

Co-Clustering With Manifold And Double Sparse Representation

Sparse Poisson coding for high dimensional document clustering

Topic extraction from extremely short texts with variational manifold regularization

Regularized bi-directional co-clustering

TSDPMM: Incorporating Prior Topic Knowledge into Dirichlet Process Mixture Models for Text Clustering.

A Model-Based Approach for Text Clustering with Outlier Detection

A Lda-Based Algorithm For Length-Aware Text Clustering

TSSE-DMM: Topic Modeling for Short Texts Based on Topic Subdivision and Semantic Enhancement

A CWTM Model of Topic Extraction for Short Text.

CDW: A Text Clustering Model for Diverse Versions Discovery.

Joint unsupervised contrastive learning and robust GMM for text clustering

Subspace Clustering of Very Sparse High-Dimensional Data

Short Text Model Based on Strong Feature Thesaurus

Inference and evaluation of the multinomial mixture model for text clustering

Clustering of Chinese Sentences Using the SMM Model

Model-based Clustering of Short Text Streams

Topic model based on co-occurrence word networks for unbalanced short text datasets