Abstract:Text clustering techniques were usually used to structure the text documents into topic related groups which can facilitate users to get a comprehensive understanding on corpus or results from information retrieval system. Most of existing text clustering algorithm which derived from traditional formatted data clustering heavily rely on term analysis methods and adopted Vector Space Model (VSM) as their document representation. But because of the essential characteristic underlying text such as high dimensionality features vector space, the problem of sparseness has a strong impact on the clustering algorithm. So feature reduction is an important preprocess step for improving the efficiency and accuracy of clustering algorithm by removing redundant and irrelevant terms from corpus. Even the clustering is considered as an unsupervised learning method, but in text, there is still some priori knowledge we can use from NLP analysis based approach. In this paper, we propose a semantic analysis based feature reduction method which used in Chinese text clustering. Our method bases on a dedicated Part-of-Speech tags selection and synonyms consolidation and can reduce the feature space of documents more effectively compared with traditional feature reduction method tfidf and stopwords removal, meanwhile it preserves or sometimes even improves the accuracy of clustering algorithm. In our experiment, we tested our feature reduction method using bisecting k-means algorithm which was proved be efficient in text clustering. The results show that our method can reduce the feature space significantly, and meanwhile have a better clustering accuracy in terms of the purity.

Text Clustering Using Vsm with Feature Clusters

VSM-based Text Clustering Algorithm

Knowledge-based Vector Space Model for Text Clustering.

THE IMPROVEMENT OF VSM MODEL BASED ON SEMANTICS

The heavy frequency vector-based text clustering

Text Clustering Based on Feature Space

Web Document Clustering Algorithm Based on Heigh Performance Feature Selecting Function

A Method to Improve Text Clustering Algorithm Quality

Towards Semantically Sensitive Text Clustering: a Feature Space Modeling Technology Based on Dimension Extension.

Improved VSM Based on Chinese Text Categorization

Text Features Extraction based on TF-IDF Associating Semantic

A feature selection algorithm for document clustering based on word co-occurrence frequency

Semantic Feature Reduction in Chinese Document Clustering

VRCA: A Clustering Algorithm for Massive Amount of Texts

Text Feature Description Based on Word Co-Occurrence

Text Clustering Algorithm Based on Spectral Graph Seriation

Improved VSM for Incremental Text Classification

A Comparative Study on Feature Window Selection in Text Filtering

Word Distributed Representation Based Text Clustering.

A Vector Reconstruction Based Clustering Algorithm Particularly for Large-Scale Text Collection

The Research on Text Clustering Based on LDA Joint Model