ZeroDL: Zero-shot Distribution Learning for Text Clustering via Large Language Models

Hwiyeol Jo,Hyunwoo Lee,Taiwoo Park
2024-06-19
Abstract:The recent advancements in large language models (LLMs) have brought significant progress in solving NLP tasks. Notably, in-context learning (ICL) is the key enabling mechanism for LLMs to understand specific tasks and grasping nuances. In this paper, we propose a simple yet effective method to contextualize a task toward a specific LLM, by (1) observing how a given LLM describes (all or a part of) target datasets, i.e., open-ended zero-shot inference, and (2) aggregating the open-ended inference results by the LLM, and (3) finally incorporate the aggregated meta-information for the actual task. We show the effectiveness of this approach in text clustering tasks, and also highlight the importance of the contextualization through examples of the above procedure.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the problem of utilizing large language models (LLMs) for zero-shot learning in text clustering tasks. Specifically, the paper proposes a new method called Zero-shot Distribution Learning (ZeroDL), which generates meta-information about data distribution through zero-shot inference and uses it in actual text clustering tasks. This method avoids the traditional clustering algorithms' reliance on embedding space distance metrics and instead injects specific perspectives through text-level prompts, enabling the model to perform more targeted and contextually relevant clustering. The main steps of ZeroDL include: 1. **Open-ended zero-shot inference**: Design a simple prompt to allow the LLM to make classification predictions on the input data. 2. **Output aggregation**: Aggregate the open-ended prediction results to generate meta-information. 3. **Using meta-information for prediction**: Integrate the aggregated meta-information into the classification prompt to enhance the model's predictive capability. Experimental results show that ZeroDL performs comparably or even better than embedding-based clustering methods on multiple datasets, especially excelling on smaller datasets. Additionally, ZeroDL can automatically generate richly interpretable category labels, providing deeper insights into understanding LLMs.