Abstract:We present {\em generative clustering} (GC) for clustering a set of documents, $\mathrm{X}$, by using texts $\mathrm{Y}$ generated by large language models (LLMs) instead of by clustering the original documents $\mathrm{X}$. Because LLMs provide probability distributions, the similarity between two documents can be rigorously defined in an information-theoretic manner by the KL divergence. We also propose a natural, novel clustering algorithm by using importance sampling. We show that GC achieves the state-of-the-art performance, outperforming any previous clustering method often by a large margin. Furthermore, we show an application to generative document retrieval in which documents are indexed via hierarchical clustering and our method improves the retrieval accuracy.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to use generative large language models (LLMs) to improve the effect of document clustering. Specifically, the author proposes a new generative clustering method (Generative Clustering, GC), which performs clustering by using the text generated by LLMs instead of the original documents. This method can not only capture the similarity between documents more accurately, but also strictly define the distance metric between documents within the information - theoretic framework. ### Core problems of the paper 1. **Limitations of traditional document clustering**: - Traditional methods usually convert documents into some computational representations (such as the bag - of - words model, BERT embeddings, etc.), and then apply clustering algorithms (such as k - means). However, these methods may not be able to fully capture the implicit knowledge and complex semantic information in documents. 2. **Introduction of generative large language models**: - By using generative large language models (such as GPT - 4), a series of texts supplemented with missing knowledge can be generated for each document. This can not only better interpret the content of documents, but also strictly define the similarity between documents based on information theory. 3. **Clustering within the information - theoretic framework**: - The author proposes a distance metric method based on KL divergence (Kullback - Leibler divergence) to quantify the differences between documents. This method can perform clustering in an infinite text sequence space, and solves the computational challenges through the importance sampling technique. ### Main contributions - **Proposing a new generative clustering method**: By using the text generated by LLMs for clustering, the clustering effect is significantly improved. - **Strict definition within the information - theoretic framework**: Using KL divergence as a distance metric makes the clustering process more rigorous. - **Experimental proof of superior performance**: On multiple document clustering datasets, this method is significantly superior to existing clustering methods, especially in terms of accuracy, normalized mutual information (NMI) and adjusted Rand index (ARI). ### Formula summary - **Document generation probability distribution**: \[ p(Y = y|x) = p(\omega_1|x)p(\omega_2|x,\omega_1)\cdots p(\omega_l|x,\omega_1,\ldots,\omega_{l - 1}) \] where $\omega_1,\omega_2,\ldots,\omega_l$ are the words in text $y$, and $l$ is the text length. - **Clustering objective function**: \[ \min_{f,\{p(Y|k)\}_k}D=\sum_{x\in X}d(x,f(x)) \] where $d(x,k) = KL[p(Y|x)\|p(Y|k)]$ represents the KL divergence between document $x$ and the clustering center $k$. - **Importance weight matrix**: \[ W_{ij}=\left(\frac{p(y_j|x_i)}{\phi(y_j)}\right)^\alpha \] where $\alpha$ is a regularization parameter, and $\phi(y_j)$ is the proposal distribution. - **Estimating the clustering center**: \[ c_k(j)=\frac{\sum_{x_i\in f^{-1}(k)}W_{ij}}{\sum_j c_k(j)} \] Through these formulas and methods, the author has successfully demonstrated the superior performance of the generative clustering method on multiple datasets.

Information-Theoretic Generative Clustering of Documents

Document Clustering Using Locality Preserving Indexing

Document Clustering Based on Probabilistic Topic Model

Learning Deep Generative Clustering Via Mutual Information Maximization

Parallel Topic Model and Its Application on Document Clustering.

Generalized Clustering and Multi-Manifold Learning with Geometric Structure Preservation

Sparse Poisson coding for high dimensional document clustering

Regularized clustering for documents.

Vec2GC -- A Graph Based Clustering Method for Text Representations

Application of Genetic Algorithm in Document Clustering

Semi-crowdsourced Clustering with Deep Generative Models

Representing Document As Dependency Graph for Document Clustering

Variational Information Bottleneck for Unsupervised Clustering: Deep Gaussian Mixture Embedding

A Survey of Generative Information Retrieval

Cross-Lingual Document Clustering Based on Similarity Space Model

Design and simulation of a document clustering algorithm based on genetic algorithm

CLGVSM: Adapting Generalized Vector Space Model to Cross-lingual Document Clustering.

ClusterLLM: Large Language Models as a Guide for Text Clustering

Hierarchical Clustering Algorithms for Document Datasets

Generative Text Convolutional Neural Network for Hierarchical Document Representation Learning

Joint unsupervised contrastive learning and robust GMM for text clustering