Abstract:Text clustering techniques were usually used to structure the text documents into topic related groups which can facilitate users to get a comprehensive understanding on corpus or results from information retrieval system. Most of existing text clustering algorithm which derived from traditional formatted data clustering heavily rely on term analysis methods and adopted Vector Space Model (VSM) as their document representation. But because of the essential characteristic underlying text such as high dimensionality features vector space, the problem of sparseness has a strong impact on the clustering algorithm. So feature reduction is an important preprocess step for improving the efficiency and accuracy of clustering algorithm by removing redundant and irrelevant terms from corpus. Even the clustering is considered as an unsupervised learning method, but in text, there is still some priori knowledge we can use from NLP analysis based approach. In this paper, we propose a semantic analysis based feature reduction method which used in Chinese text clustering. Our method bases on a dedicated Part-of-Speech tags selection and synonyms consolidation and can reduce the feature space of documents more effectively compared with traditional feature reduction method tfidf and stopwords removal, meanwhile it preserves or sometimes even improves the accuracy of clustering algorithm. In our experiment, we tested our feature reduction method using bisecting k-means algorithm which was proved be efficient in text clustering. The results show that our method can reduce the feature space significantly, and meanwhile have a better clustering accuracy in terms of the purity.

Subspace clustering of text documents with feature weighting k-means algorithm

A feature group weighting method for subspace clustering of high-dimensional data

CWC: A Clustering-Based Feature Weighting Approach for Text Classification

A Linguistic Feature Based Text Clustering Method.

Subspace Clustering by Directly Solving Discriminative K-means

An Improved K-means Algorithm for Document Clustering

Document Clustering Using Sample Weighting

A Novel Rough Semi-Supervised K-Means Algorithm for Text Clustering

Data Clustering Method with Feature Semantic Weight

Sparse Poisson coding for high dimensional document clustering

Subspace Clustering of Very Sparse High-Dimensional Data

A New Partitioning Based Algorithm for Document Clustering.

Clustering Algorithm on Block Division of Documents

A Feature Value Weighted Method Based on Paragraph Co-occurrence Frequency

A Novel Text Clustering Algorithm Based on Inner Product Space Model of Semantic

Towards Semantically Sensitive Text Clustering: a Feature Space Modeling Technology Based on Dimension Extension.

The heavy frequency vector-based text clustering

A Fuzzy K-modes-based Algorithm for Soft Subspace Clustering

Semantic Feature Reduction in Chinese Document Clustering

Text clustering based on term weights automatic partition

Exploiting Word Cluster Information for Unsupervised Feature Selection