A General Framework of Feature Selection for Text Categorization

Hongfang Jing,Bin Wang,Yahui Yang,Yan Xu
DOI: https://doi.org/10.1007/978-3-642-03070-3_49
2009-01-01
Abstract:Many feature selection methods have been proposed for text categorization. However, their performances are usually verified by experiments. so the results rely on the corpora used and may not be accurate. This paper proposes a novel feature selection framework called Distribution-Based Feature Selection (DBFS) based on distribution difference of features. This framework generalizes most of the state-of-the-art feature selection methods including OCFS, MI, ECE, IG, CHI and OR. The performances of many feature selection methods can be estimated by theoretical analysis using components of this framework. Besides, DBFS sheds light on the merits and drawbacks of many existing feature selection methods. In addition, this framework helps to select suitable Feature selection methods for specific domains. Moreover, a weighted model based on DBFS is given so that suitable feature selection methods for unbalanced datasets can be derived. The experimental results show that they are more effective than CHI, IG and OCFS on both balanced and unbalanced datasets.
What problem does this paper attempt to address?