Abstract:Text clustering is an important method for organising the increasing volume of digital content, aiding in the structuring and discovery of hidden patterns in uncategorised data. The effectiveness of text clustering largely depends on the selection of textual embeddings and clustering algorithms. This study argues that recent advancements in large language models (LLMs) have the potential to enhance this task. The research investigates how different textual embeddings, particularly those utilised in LLMs, and various clustering algorithms influence the clustering of text datasets. A series of experiments were conducted to evaluate the impact of embeddings on clustering results, the role of dimensionality reduction through summarisation, and the adjustment of model size. The findings indicate that LLM embeddings are superior at capturing subtleties in structured language. OpenAI's GPT-3.5 Turbo model yields better results in three out of five clustering metrics across most tested datasets. Most LLM embeddings show improvements in cluster purity and provide a more informative silhouette score, reflecting a refined structural understanding of text data compared to traditional methods. Among the more lightweight models, BERT demonstrates leading performance. Additionally, it was observed that increasing model dimensionality and employing summarisation techniques do not consistently enhance clustering efficiency, suggesting that these strategies require careful consideration for practical application. These results highlight a complex balance between the need for refined text representation and computational feasibility in text clustering applications. This study extends traditional text clustering frameworks by integrating embeddings from LLMs, offering improved methodologies and suggesting new avenues for future research in various types of textual analysis.

A Lda-Based Algorithm For Length-Aware Text Clustering

CLDA: Feature Selection for Text Categorization Based on Constrained LDA

Text Clustering as Classification with LLMs

X-DMM: Fast and Scalable Model Based Text Clustering

Sentiment word co-occurrence and knowledge pair feature extraction based LDA short text clustering algorithm

Optimizing Text Clustering Efficiency through Flexible Latent Dirichlet Allocation Method: Exploring the Impact of Data Features and Threshold Modification

A Text Clustering Algorithm to Detect Basic Level Categories in Texts

Clust-LDA: Joint Model for Text Mining and Author Group Inference

Feature Dimension Reduction Short Text Clustering Combined with Semantic and Statistics

Enhancing Web Text Clustering Accuracy and Efficiency With a Maximum Entropy Function Model: Overcoming High-Dimensional and Directional Challenges

An Unsupervised Learning Short Text Clustering Method

DIAS: A Disassemble-Assemble Framework for Highly Sparse Text Clustering

A Linguistic Feature Based Text Clustering Method.

Wt-Lda: User Tagging Augmented Lda For Web Service Clustering

TCUAP: A Novel Approach of Text Clustering Using Asymmetric Proximity.

Text Clustering with Large Language Model Embeddings

CDW: A Text Clustering Model for Diverse Versions Discovery.

Text Stream Clustering Algorithm Based on Adaptive Feature Selection.

Labeling Clusters from Both Linguistic and Statistical Perspectives: A Hybrid Approach

Combining Text Clustering and Retrieval for Corpus Adaptation