Abstract:Text clustering techniques were usually used to structure the text documents into topic related groups which can facilitate users to get a comprehensive understanding on corpus or results from information retrieval system. Most of existing text clustering algorithm which derived from traditional formatted data clustering heavily rely on term analysis methods and adopted Vector Space Model (VSM) as their document representation. But because of the essential characteristic underlying text such as high dimensionality features vector space, the problem of sparseness has a strong impact on the clustering algorithm. So feature reduction is an important preprocess step for improving the efficiency and accuracy of clustering algorithm by removing redundant and irrelevant terms from corpus. Even the clustering is considered as an unsupervised learning method, but in text, there is still some priori knowledge we can use from NLP analysis based approach. In this paper, we propose a semantic analysis based feature reduction method which used in Chinese text clustering. Our method bases on a dedicated Part-of-Speech tags selection and synonyms consolidation and can reduce the feature space of documents more effectively compared with traditional feature reduction method tfidf and stopwords removal, meanwhile it preserves or sometimes even improves the accuracy of clustering algorithm. In our experiment, we tested our feature reduction method using bisecting k-means algorithm which was proved be efficient in text clustering. The results show that our method can reduce the feature space significantly, and meanwhile have a better clustering accuracy in terms of the purity.

A Statistics-Based Semantic Relation Analysis Approach For Document Clustering

Document Clustering Using Locality Preserving Indexing

A link-based approach to semantic relation analysis

Coupled Term-Term Relation Analysis for Document Clustering

Semantic document clustering based on ontology

A Semantic Approach for Text Clustering Using WordNet and Lexical Chains

Joint Probability Consistent Relation Analysis for Document Representation.

Semantic Feature Reduction in Chinese Document Clustering

Document Clustering Based on Semantic Smoothing Approach

Semantic Correlation Network Based Text Clustering

A Semantic approach for effective document clustering using WordNet

Unsupervised Learning of Semantic Representation for Documents with the Law of Total Probability.

Document Clustering Based on Word Sense Cluster

A spectral analysis approach to document summarization: Clustering and ranking sentences simultaneously.

Clustering articles based on semantic similarity

Semantic Smoothing for Model-based Document Clustering

Clustering Technology for High Dimensional Data Based on Semantics

Text Clustering Via Term Semantic Units

Towards Semantically Sensitive Text Clustering: a Feature Space Modeling Technology Based on Dimension Extension.

Semantic smoothing of document models for agglomerative clustering

An Approach of Latent Semantic Space Partition and Web Document Clustering