Abstract:Text clustering is an important method for organising the increasing volume of digital content, aiding in the structuring and discovery of hidden patterns in uncategorised data. The effectiveness of text clustering largely depends on the selection of textual embeddings and clustering algorithms. This study argues that recent advancements in large language models (LLMs) have the potential to enhance this task. The research investigates how different textual embeddings, particularly those utilised in LLMs, and various clustering algorithms influence the clustering of text datasets. A series of experiments were conducted to evaluate the impact of embeddings on clustering results, the role of dimensionality reduction through summarisation, and the adjustment of model size. The findings indicate that LLM embeddings are superior at capturing subtleties in structured language. OpenAI's GPT-3.5 Turbo model yields better results in three out of five clustering metrics across most tested datasets. Most LLM embeddings show improvements in cluster purity and provide a more informative silhouette score, reflecting a refined structural understanding of text data compared to traditional methods. Among the more lightweight models, BERT demonstrates leading performance. Additionally, it was observed that increasing model dimensionality and employing summarisation techniques do not consistently enhance clustering efficiency, suggesting that these strategies require careful consideration for practical application. These results highlight a complex balance between the need for refined text representation and computational feasibility in text clustering applications. This study extends traditional text clustering frameworks by integrating embeddings from LLMs, offering improved methodologies and suggesting new avenues for future research in various types of textual analysis.

TeC: A Novel Method for Text Clustering with Large Language Models Guidance and Weakly-Supervised Contrastive Learning

ClusterLLM: Large Language Models as a Guide for Text Clustering

Text Clustering as Classification with LLMs

Joint unsupervised contrastive learning and robust GMM for text clustering

SimCTC: A Simple Contrast Learning Method of Text Clustering (Student Abstract)

Large Language Models Enable Few-Shot Clustering

CEIL: A General Classification-Enhanced Iterative Learning Framework for Text Clustering

Contrastive Learning Subspace for Text Clustering

Advancing Single- and Multi-task Text Classification through Large Language Model Fine-tuning

Text Clustering with Large Language Model Embeddings

TnT-LLM: Text Mining at Scale with Large Language Models

Context-Aware Clustering using Large Language Models

Narrowing the Gap between Supervised and Unsupervised Sentence Representation Learning with Large Language Model

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

Constructive Large Language Models Alignment with Diverse Feedback

Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation

LLMEmbed: Rethinking Lightweight LLM's Genuine Function in Text Classification

CELDA: Leveraging Black-box Language Model as Enhanced Classifier without Labels

Multi-Task Curriculum Graph Contrastive Learning with Clustering Entropy Guidance

Mitigating Boundary Ambiguity and Inherent Bias for Text Classification in the Era of Large Language Models

Enhancing Label Correlation Feedback in Multi-Label Text Classification via Multi-Task Learning