Abstract:Sentence clustering plays a central role in various text-processing activities and has received extensive attention for measuring semantic similarity between compared sentences. However, relatively little focus has been placed on evaluating clustering performance using available similarity measures that adopt low-dimensional continuous representations. Such representations are crucial in domains like sentence clustering, where traditional word co-occurrence representations often achieve poor results when clustering semantically similar sentences that share no common words. This article presents a new implementation that incorporates a sentence similarity measure based on the notion of embedding representation for evaluating the performance of three types of text clustering methods: partitional clustering, hierarchical clustering, and fuzzy clustering, on standard textual datasets. This measure derives its semantic information from pre-training models designed to simulate human knowledge about words in natural language. The article also compares the performance of the used similarity measure by training it on two state-of-the-art pre-training models to investigate which yields better results. We argue that the superior performance of the selected clustering methods stems from their more effective use of the semantic information offered by this embedding-based similarity measure. Furthermore, we use hierarchical clustering, the best-performing method, for a text summarization task and report the results. The implementation in this article demonstrates that incorporating the sentence embedding measure leads to significantly improved performance in both text clustering and text summarization tasks.

Short text clustering based on word embeddings and EMD

Improving Short Text Classification Through Better Feature Space Selection

An Unsupervised Learning Short Text Clustering Method

A Novel Text Clustering Algorithm Based on Inner Product Space Model of Semantic

A Linguistic Feature Based Text Clustering Method.

A New Text Clustering Method Using Hidden Markov Model

Enhancing Web Text Clustering Accuracy and Efficiency With a Maximum Entropy Function Model: Overcoming High-Dimensional and Directional Challenges

X-DMM: Fast and Scalable Model Based Text Clustering

Improving Medical Short Text Classification with Semantic Expansion Using Word-Cluster Embedding

Text clustering based on pre-trained models and autoencoders

Feature Dimension Reduction Short Text Clustering Combined with Semantic and Statistics

A Short Text Clustering Approaches in Social Media

Experimental study on short-text clustering using transformer-based semantic similarity measure

Subspace Clustering by Directly Solving Discriminative K-means

Grouped Text Clustering Using Non-Parametric Gaussian Mixture Experts

Research on a Text Data Preprocessing Method Suitable for Clustering Algorithm

Representation Learning for Short Text Clustering

Clustering Text Data Streams

Web Service Clustering Method Based on Word Vector and Biterm Topic Model

A Clustering Algorithm for Short Documents Based On Concept Similarity