Select Strong Information Features to Improve Text Categorization Effectiveness

Xue, Dejun,Sun, Maosong
DOI: https://doi.org/10.1515/jisys.2004.13.4.275
2004-01-01
Journal of Intelligent Systems
Abstract:High feature dimensionality is one of the main obstacles in Text Categorization (TC).This paper focused on the solution of feature selection to reduce feature dimensionality in TC.We first classified the original features into three types according to their contributions to categorization, including strong information features, weak information features, and irrelevant features.Then, we put forward the Constrained Information Gain (CIG) measure that preferred to low-frequency informative features for categorization by ignoring negative evidence in classic IG measure.By concentrating on the first type of feature, we further proposed a novel feature selection measure, Chi-CIG, by combining Chi and CIG measures.Based on class-centroid-based classifier and Chinese character bigram features, a TC system for Chinese documents was designed.Experimental results on a large-scale document collection (71,674 documents) indicated that Chi-CIG measure set up a more effective feature set for categorization than did classic Chi and IG measures.
What problem does this paper attempt to address?