Comparing different text representation and feature selection methods on Chinese text classification using Character n-grams

Zhihua Wei,Jean-Hugues Chauchat,Duoqian Miao
2008-01-01
Abstract:In this paper, we perform Chinese text categorization using n-gram text representation on TanCorpV1.0 which is a new corpus, special for Chinese text classification of more than 14,000 texts divided in 12 classes. We use a combination of methods, including between inter-class feature reduction methods and cross-class feature selection methods. We use the C-SVC classifier (with a linear kernel) which is the SVM algorithm made for the multi-classification task. We perform our experiments in the TANAGRA platform. Our experiments concern: (1) the performance comparison between using both 1-, 2-grams and using 1-, 2-, 3- gram in Chinese text representation; (2) the performance comparison between using different feature representations: absolute text frequency, relative text frequency, absolute n-gram frequency and relative n-gram frequency; (3) the comparison of the sparseness in the "text*feature" matrix between using n-gram frequency and frequency in feature selection; (4) the performance comparison between two text coding methods: the 0/1 logical value and the n-gram frequency numeric value. We found out that in the case of using less than 3,000 features, the feature selection methods based on n-gram frequency (absolute or relative) always yield better results.
What problem does this paper attempt to address?