Raising High-Degree Overlapped Character Bigrams Into Trigrams For Dimensionality Reduction In Chinese Text Categorization

Dj Xue,Ms Sun
DOI: https://doi.org/10.1007/978-3-540-24630-5_72
2004-01-01
Abstract:High dimensionality of feature space is a crucial obstacle for Automated Text Categorization. According to the characteristics of Chinese character N-grams, this paper reveals that there exists a kind of redundancy arising from feature overlapping. Focusing on Chinese character bigrams, the paper puts forward a concept of delta-overlapping between two bigrams, and proposes a new method of dimensionality reduction, called delta-Overlapped Raising (delta-OR), by raising the delta-overlapped bigrams into their corresponding trigrams. Moreover, the paper designs a two-Stage dimensionality reduction strategy for Chinese bigrams by integrating a filtering method based on Chi-CIG score function and the delta-OR method. Experimental results on a large-scale Chinese document collection indicate that, on the basis of the first stage of reduction processing, delta-OR at the second stage can significantly reduce the dimension of feature space without sacrificing categorization effectiveness. We believe that the above methodology would be language-independent.
What problem does this paper attempt to address?