An Improved Feature Selection Algorithm Utilizing the Within Category Variance

P. J. Zhang,S. C. Gan
DOI: https://doi.org/10.2991/eame-15.2015.217
2015-01-01
Abstract:The chi 2 statistics is a commonly used and effective method of feature selection for corpus. However, it suffers several deficiencies. First, it only counts the document frequency for each feature. Secondly, this method does not distinguish among features that have different frequency distributions within a category. To overcome these shortcomings, two indexes, naming, the within category frequency and the within category variance, are introduced. Experiments are carried out to compare the traditional chi 2 statistics, some existing improvement, and the improved chi 2 statistics proposed in this paper using either naive Bayesian or SVM on the corpus collected by Fudan University and Sogou. Experimental results reveal that the improvement of this paper is effective and robust with respect to various classifiers and corpus.
What problem does this paper attempt to address?