Abstract:Text clustering is a widely studied problem in the text mining domain. The Dirichlet Multinomial Mixture (DMM) model based clustering algorithms have shown good performance to cope with high dimensional sparse text data, obtaining reasonable results in both clustering accuracy and computational efficiency. However, the time complexity of DMM model training is proportional to the average document length and the number of clusters, making it inefficient for scaling up to long text and large corpora, which is common in realworld applications such as documents organization, retrieval and recommendation. In this paper, we leverage a symmetric prior setting for Dirichlet distribution, and build indices to decrease the time complexity of the sampling-based training for DMM from O(K∗L) to O(K∗U), where K is the number of clusters, L the average length of document, and U the average number of unique words in each document. We introduce a Metropolis-Hastings sampling algorithm, which further reduces the sampling time complexity from O(K∗U) to O(U) in the nearly-to-convergence training stages. Moreover, we also parallelize the DMM model training to obtain a further acceleration by using an uncollapsed Gibbs sampler. We combine all these optimizations into a highly efficient implementation, called X-DMM, which enables the DMM model to scale up for long and large-scale text clustering. We evaluate the performance of X-DMM on several real world datasets, and the experimental results show that XDMM achieves substantial speed up compared with existing state-of-the-art algorithms without clustering accuracy degradation.

Clustering in Zero-Resource

Cluster-based Contrastive Disentangling for Generalized Zero-Shot Learning

ZeroDL: Zero-shot Distribution Learning for Text Clustering via Large Language Models

An embedded segmental K-means model for unsupervised segmentation and clustering of speech

A new DP-like speaker clustering algorithm

Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

Self Supervised Representation Learning with Deep Clustering for Acoustic Unit Discovery from Raw Speech

A Practical Algorithm for Solving the Sparseness Problem of Short Text Clustering

Analysis of Deep Clustering as Preprocessing for Automatic Speech Recognition of Sparsely Overlapping Speech

X-DMM: Fast and Scalable Model Based Text Clustering

Representation Learning With Hidden Unit Clustering For Low Resource Speech Applications

Zero Resource Cross-Lingual Part Of Speech Tagging

Exploiting Cross-Lingual Knowledge in Unsupervised Acoustic Modeling for Low-Resource Languages

Multilingual acoustic word embedding models for processing zero-resource languages

Bringing in the outliers: A sparse subspace clustering approach to learn a dictionary of mouse ultrasonic vocalizations

Discriminative Neural Clustering for Speaker Diarisation.

Beyond Prompting: Making Pre-trained Language Models Better Zero-shot Learners by Clustering Representations

Low-Latency Deep Clustering For Speech Separation

An Investigation of Context Clustering for Statistical Speech Synthesis with Deep Neural Network.

ChatZero:Zero-shot Cross-Lingual Dialogue Generation via Pseudo-Target Language

State-Level Data Borrowing for Low-Resource Speech Recognition Based on Subspace GMMs.