Abstract:We present {\em generative clustering} (GC) for clustering a set of documents, $\mathrm{X}$, by using texts $\mathrm{Y}$ generated by large language models (LLMs) instead of by clustering the original documents $\mathrm{X}$. Because LLMs provide probability distributions, the similarity between two documents can be rigorously defined in an information-theoretic manner by the KL divergence. We also propose a natural, novel clustering algorithm by using importance sampling. We show that GC achieves the state-of-the-art performance, outperforming any previous clustering method often by a large margin. Furthermore, we show an application to generative document retrieval in which documents are indexed via hierarchical clustering and our method improves the retrieval accuracy.
Machine Learning,Computation and Language,Information Retrieval,Information Theory
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to use generative large language models (LLMs) to improve the effect of document clustering. Specifically, the author proposes a new generative clustering method (Generative Clustering, GC), which performs clustering by using the text generated by LLMs instead of the original documents. This method can not only capture the similarity between documents more accurately, but also strictly define the distance metric between documents within the information - theoretic framework.
### Core problems of the paper
1. **Limitations of traditional document clustering**:
- Traditional methods usually convert documents into some computational representations (such as the bag - of - words model, BERT embeddings, etc.), and then apply clustering algorithms (such as k - means). However, these methods may not be able to fully capture the implicit knowledge and complex semantic information in documents.
2. **Introduction of generative large language models**:
- By using generative large language models (such as GPT - 4), a series of texts supplemented with missing knowledge can be generated for each document. This can not only better interpret the content of documents, but also strictly define the similarity between documents based on information theory.
3. **Clustering within the information - theoretic framework**:
- The author proposes a distance metric method based on KL divergence (Kullback - Leibler divergence) to quantify the differences between documents. This method can perform clustering in an infinite text sequence space, and solves the computational challenges through the importance sampling technique.
### Main contributions
- **Proposing a new generative clustering method**: By using the text generated by LLMs for clustering, the clustering effect is significantly improved.
- **Strict definition within the information - theoretic framework**: Using KL divergence as a distance metric makes the clustering process more rigorous.
- **Experimental proof of superior performance**: On multiple document clustering datasets, this method is significantly superior to existing clustering methods, especially in terms of accuracy, normalized mutual information (NMI) and adjusted Rand index (ARI).
### Formula summary
- **Document generation probability distribution**:
\[
p(Y = y|x) = p(\omega_1|x)p(\omega_2|x,\omega_1)\cdots p(\omega_l|x,\omega_1,\ldots,\omega_{l - 1})
\]
where \(\omega_1,\omega_2,\ldots,\omega_l\) are the words in text \(y\), and \(l\) is the text length.
- **Clustering objective function**:
\[
\min_{f,\{p(Y|k)\}_k}D=\sum_{x\in X}d(x,f(x))
\]
where \(d(x,k) = KL[p(Y|x)\|p(Y|k)]\) represents the KL divergence between document \(x\) and the clustering center \(k\).
- **Importance weight matrix**:
\[
W_{ij}=\left(\frac{p(y_j|x_i)}{\phi(y_j)}\right)^\alpha
\]
where \(\alpha\) is a regularization parameter, and \(\phi(y_j)\) is the proposal distribution.
- **Estimating the clustering center**:
\[
c_k(j)=\frac{\sum_{x_i\in f^{-1}(k)}W_{ij}}{\sum_j c_k(j)}
\]
Through these formulas and methods, the author has successfully demonstrated the superior performance of the generative clustering method on multiple datasets.