Abstract:In this article, we investigate the use of a probabilistic model for unsupervised clustering in text collections. Unsupervised clustering has become a basic module for many intelligent text processing applications, such as information retrieval, text classification or information extraction.Recent proposals have been made of probabilistic clustering models, which build “soft” theme-document associations. These models allow to compute, for each document, a probability vector whose values can be interpreted as the strength of the association between documents and clusters. As such, these vectors can also serve to project texts into a lower-dimensional “semantic” space. These models however pose non-trivial estimation problems, which are aggravated by the very high dimensionality of the parameter space.The model considered in this paper consists of a mixture of multinomial distributions over the word counts, each component corresponding to a different theme. We propose a systematic evaluation framework to contrast various estimation procedures for this model. Starting with the expectation-maximization (EM) algorithm as the basic tool for inference, we discuss the importance of initialization and the influence of other features, such as the smoothing strategy or the size of the vocabulary, thereby illustrating the difficulties incurred by the high dimensionality of the parameter space. We empirically show that, in the case of text processing, these difficulties can be alleviated by introducing the vocabulary incrementally, due to the specific profile of the word count distributions. Using the fact that the model parameters can be analytically integrated out, we finally show that Gibbs sampling on the theme configurations is tractable and compares favorably to the basic EM approach.

Short Text Clustering Based on Pitman-Yor Process Mixture Model

An Adaptive Dirichlet Multinomial Mixture Model for Short Text Streaming Clustering

Model-based clustering and classification using mixtures of multivariate skewed power exponential distributions

A Model-Based Approach for Text Clustering with Outlier Detection

X-DMM: Fast and Scalable Model Based Text Clustering

Inference and evaluation of the multinomial mixture model for text clustering

A New Text Clustering Method Using Hidden Markov Model

TSDPMM: Incorporating Prior Topic Knowledge into Dirichlet Process Mixture Models for Text Clustering.

Hierarchical mixtures of Unigram models for short text clustering: the role of Beta-Liouville priors

A Gamma-Poisson Mixture Topic Model for Short Text

Mixture Model-based Text Clustering：A Review

Grouped Text Clustering Using Non-Parametric Gaussian Mixture Experts

Reducing over-clustering via the powered Chinese restaurant process

Clustering Mixed-Type Data via Dirichlet Process Mixture Model with Cluster-Specific Covariance Matrices

Estimating the Number of Clusters Via Proportional Chinese Restaurant Process

CDP Mixture Models for Data Clustering

Bayesian mixture models (in)consistency for the number of clusters

Deep Dirichlet process mixture models.

Optimal Clustering of Discrete Mixtures: Binomial, Poisson, Block Models, and Multi-layer Networks

A Probabilistic Model For Clustering Text Documents With Multiple Fields

Joint unsupervised contrastive learning and robust GMM for text clustering