Text Clustering as Classification with LLMs

Chen Huang,Guoxiu He
2024-10-01
Abstract:Text clustering remains valuable in real-world applications where manual labeling is cost-prohibitive. It facilitates efficient organization and analysis of information by grouping similar texts based on their representations. However, implementing this approach necessitates fine-tuned embedders for downstream data and sophisticated similarity metrics. To address this issue, this study presents a novel framework for text clustering that effectively leverages the in-context learning capacity of Large Language Models (LLMs). Instead of fine-tuning embedders, we propose to transform the text clustering into a classification task via LLM. First, we prompt LLM to generate potential labels for a given dataset. Second, after integrating similar labels generated by the LLM, we prompt the LLM to assign the most appropriate label to each sample in the dataset. Our framework has been experimentally proven to achieve comparable or superior performance to state-of-the-art clustering methods that employ embeddings, without requiring complex fine-tuning or clustering algorithms. We make our code available to the public for utilization at <a class="link-external link-https" href="https://anonymous.4open.science/r/Text-Clustering-via-LLM-E500" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Information Retrieval
What problem does this paper attempt to address?
The paper aims to address the problem of text clustering, especially in practical applications where the cost of manual annotation is too high. Traditional methods usually require fine-tuning of the embedder and rely on complex similarity measures, which are not only time-consuming but also require a lot of computational resources. The paper proposes a novel framework that transforms the text clustering task into a classification task by leveraging the contextual learning capabilities of large language models (LLMs). Specifically, the method is divided into two stages: 1. **Label Generation**: First, potential labels for the dataset are generated by prompting the LLM. 2. **Classification Task**: Then, after integrating similar labels, the LLM is prompted again to classify each sample based on these labels. This method does not require fine-tuning of the embedder or complex clustering algorithms, and experimental results show that its performance can be comparable to or even better than existing state-of-the-art clustering methods. Additionally, the framework improves the interpretability of clustering results by enhancing the understanding of clusters through meaningful labels.