TeC: A Novel Method for Text Clustering with Large Language Models Guidance and Weakly-Supervised Contrastive Learning

Chen Yang,Bin Cao,Jing Fan
DOI: https://doi.org/10.1609/icwsm.v18i1.31419
2024-05-28
Abstract:Text clustering has become an important branch in unsupervised learning methods and has been widely used in social media. Recently, Large Language Models (LLMs) represent a significant advancement in the field of AI. Therefore, some works have been dedicated to improving the clustering performance of embedding models with feedback from LLMs. However, current approaches hardly take into consideration the cluster label information between text instances when fine-tuning embedding models, leading to the problem of cluster collision. To tackle this issue, this paper proposes TeC, a novel method operating through teaching and correcting phases. In these phases, LLMs take on the role of teachers, guiding embedding models as students to enhance their clustering performance. The teaching phase imparts guidance on cluster label information to embedding models by querying LLMs in a batch-wise manner and utilizes a proposed weakly-supervised contrastive learning loss to fine-tune embedding models based on the provided cluster label information. Subsequently, the correcting phase refines clustering outcomes obtained by the teaching phase by instructing LLMs to correct cluster assignments of low-confidence samples. The extensive experimental evaluation of six text datasets across three different clustering tasks shows the superior performance of our proposed method over existing state-of-the-art approaches.
What problem does this paper attempt to address?