Non-Independent Term Selection for Chinese Text Categorization

Li Jingyang,Sun Maosong
DOI: https://doi.org/10.1016/s1007-0214(09)70016-1
2009-01-01
Tsinghua Science & Technology
Abstract:Chinese text categorization differs from English text categorization due to its much larger term set (of words or character n-grams), which results in very slow training and working of modern high-performance classifiers. This study assumes that this high-dimensionality problem is related to the redundancy in the term set, which cannot be solved by traditional term selection methods. A greedy algorithm framework named "non-independent term selection" is presented, which reduces the redundancy according to string-level correlations. Several preliminary implementations of this idea are demonstrated. Experiment results show that a good tradeoff can be reached between the performance and the size of the term set.
What problem does this paper attempt to address?