Abstract:Text clustering aims to organize a vast collection of documents into meaningful and coherent clusters, thereby facilitating the extraction of valuable insights. While current frameworks for text clustering try to minimize the anisotropy of pre-trained language models through contrastive learning of text embeddings, the approach of treating in-batch samples as negatives is suboptimal. The K-means algorithm offers a way to sample both hard negatives and false negatives. However, relying solely on a single measure of semantic similarity between distributions and using coarse-grained weighting for negative pairs may potentially limit performance. Furthermore, considering the very similar distribution in text clusters due to rich semantics, the Mahalanobis distance-based Gaussian Mixture Model (GMM) is prone to falling into local optima due to one Gaussian model, having a smaller weight, may gradually merging into another during the parameter evaluation by the EM algorithm. To tackle these challenges, we propose a model named JourTC: Jo int u nsupervised contrastive learning and r obust GMM for T ext C lustering. In the contrastive learning phase, hard negatives, potential false negatives, and their corresponding global similarity-aware weights are determined through posterior probabilities derived from a Robust GMM (RGMM). This RGMM utilizes the entropy of each individual Gaussian model as a metric and adaptively adjusts the posterior probabilities of samples based on the Gaussian models with both maximum and minimum entropy to diminish the influence of low-entropy Gaussian models. Extensive experiments have shown that JourTC can be seamlessly integrated into existing text clustering frameworks, leading to a notable improvement in accuracy. Our code is publicly available. 1

A New Text Clustering Method Using Hidden Markov Model

X-DMM: Fast and Scalable Model Based Text Clustering

Clustering Text Data Streams

A Linguistic Feature Based Text Clustering Method.

Enhancing Web Text Clustering Accuracy and Efficiency With a Maximum Entropy Function Model: Overcoming High-Dimensional and Directional Challenges

Enhanced Bayesian Gaussian hidden Markov mixture clustering for improved knowledge discovery

Joint unsupervised contrastive learning and robust GMM for text clustering

Clustering hidden Markov models with variational HEM

An Adaptive Dirichlet Multinomial Mixture Model for Short Text Streaming Clustering

A Model-Based Approach for Text Clustering with Outlier Detection

Text Stream Clustering Algorithm Based on Adaptive Feature Selection.

Grouped Text Clustering Using Non-Parametric Gaussian Mixture Experts

A Novel Text Clustering Algorithm Based on Inner Product Space Model of Semantic

Model-based Clustering of Short Text Streams

Constrained Coclustering for Textual Documents.

A Hidden Markov Model-based fuzzy modeling of multivariate time series

Clustering Massive Text Data Streams by Semantic Smoothing Model

Research on a Text Data Preprocessing Method Suitable for Clustering Algorithm

Concept chain based text clustering

Text Clustering as Classification with LLMs

Canonical Sequence Extraction And Hmm Model Building Based On Hierarchical Clustering