Distributional Character Clustering For Chinese Text Categorization

Xz Zhou,Zh Wu
DOI: https://doi.org/10.1007/978-3-540-28633-2_61
2004-01-01
Abstract:A novel feature generation method-distributional character clustering for Chinese text categorization, which avoids word segmentation, is presented and experimentally evaluated. We propose a hybrid clustering criterion function and bisecting divisive clustering algorithm to improve the quality of clusters. The experimental results show that distributional character clustering is an effective dimensionality reduction method, which reduce the feature space to very low dimensionality (e.g. 500 features) while maintaining high performance. The performance is much better than information gain. Moreover, Naive Bayes classifier with distributional character clustering has state-of-the-art performance in Chinese text classification.
What problem does this paper attempt to address?