Abstract:Text clustering techniques were usually used to structure the text documents into topic related groups which can facilitate users to get a comprehensive understanding on corpus or results from information retrieval system. Most of existing text clustering algorithm which derived from traditional formatted data clustering heavily rely on term analysis methods and adopted Vector Space Model (VSM) as their document representation. But because of the essential characteristic underlying text such as high dimensionality features vector space, the problem of sparseness has a strong impact on the clustering algorithm. So feature reduction is an important preprocess step for improving the efficiency and accuracy of clustering algorithm by removing redundant and irrelevant terms from corpus. Even the clustering is considered as an unsupervised learning method, but in text, there is still some priori knowledge we can use from NLP analysis based approach. In this paper, we propose a semantic analysis based feature reduction method which used in Chinese text clustering. Our method bases on a dedicated Part-of-Speech tags selection and synonyms consolidation and can reduce the feature space of documents more effectively compared with traditional feature reduction method tfidf and stopwords removal, meanwhile it preserves or sometimes even improves the accuracy of clustering algorithm. In our experiment, we tested our feature reduction method using bisecting k-means algorithm which was proved be efficient in text clustering. The results show that our method can reduce the feature space significantly, and meanwhile have a better clustering accuracy in terms of the purity.

Uyghur text clustering based on semantic word set

Text Representation and Similarity Measure for Text Clustering Based on Semantic Strings: A Case Study on Uyghur Language

Feature Dimension Reduction Short Text Clustering Combined with Semantic and Statistics

Text Clustering Based on Feature Space

High-Efficiency Text Clustering Algorithm Based on Semantic Distance

Text Features Extraction based on TF-IDF Associating Semantic

Towards Semantically Sensitive Text Clustering: a Feature Space Modeling Technology Based on Dimension Extension.

A Semantic Approach for Text Clustering Using WordNet and Lexical Chains

Short text clustering based on word embeddings and EMD

WTCA: A Web Text Clustering Algorithm Based on DFSSM

Semantic Feature Reduction in Chinese Document Clustering

A Novel Text Clustering Algorithm Based on Inner Product Space Model of Semantic

Improved Suffix Tree Clustering for Uyghur Text

Text Clustering Based on Improved Latent Semantic Analysis

A Text Categorization Method Based on Features Clustering

DFSSM Based Web Text Clustering Algorithm

Text Clustering Approach Based on Maximal Frequent Term Sets

An Improved Uyghur Web Text Clustering Based on Suffix Tree

Document Clustering Based on Semantic Smoothing Approach

Research on K-means Text Clustering Algorithm Based on Semantic

Text Clustering on Short Message by Using Deep Semantic Representation