Large Language Models Enable Few-Shot Clustering

Vijay Viswanathan,Kiril Gashteovski,Carolin Lawrence,Tongshuang Wu,Graham Neubig
2023-07-02
Abstract:Unlike traditional unsupervised clustering, semi-supervised clustering allows users to provide meaningful structure to the data, which helps the clustering algorithm to match the user's intent. Existing approaches to semi-supervised clustering require a significant amount of feedback from an expert to improve the clusters. In this paper, we ask whether a large language model can amplify an expert's guidance to enable query-efficient, few-shot semi-supervised text clustering. We show that LLMs are surprisingly effective at improving clustering. We explore three stages where LLMs can be incorporated into clustering: before clustering (improving input features), during clustering (by providing constraints to the clusterer), and after clustering (using LLMs post-correction). We find incorporating LLMs in the first two stages can routinely provide significant improvements in cluster quality, and that LLMs enable a user to make trade-offs between cost and accuracy to produce desired clusters. We release our code and LLM prompts for the public to use.
Computation and Language
What problem does this paper attempt to address?
The problem this paper attempts to address is: How to utilize large language models (LLMs) to achieve few-shot semi-supervised text clustering, thereby reducing the amount of feedback required from experts and improving clustering quality. Specifically, traditional unsupervised clustering methods fail to meet the specific needs of domain experts because they cannot organize data without explicit guidance. Existing semi-supervised clustering methods, while allowing expert feedback, typically require a significant amount of expert intervention, which is costly and inefficient in practical applications. Therefore, this paper proposes a new approach that leverages large language models to amplify expert guidance, enabling the clustering algorithm to efficiently generate high-quality clustering results with minimal feedback. The paper mainly explores how to integrate large language models into the clustering process in three stages: 1. **Pre-clustering**: Enhancing text representation by generating key phrases. 2. **During clustering**: Guiding the clustering algorithm by providing pairwise constraints. 3. **Post-clustering**: Improving clustering results by correcting low-confidence cluster assignments. Experimental results show that using large language models in the first two stages (pre-clustering and during clustering) can significantly improve clustering quality and achieve results close to traditional semi-supervised clustering methods on certain tasks, but at a much lower cost. However, the post-clustering correction effect is limited.