Research on Feature Selection and Knn Classification Method in Chinese Text Classification

Xiao Chao,Wu Ping
DOI: https://doi.org/10.2991/nceece-15.2016.172
2016-01-01
Abstract:Scholars at home and abroad have done lots of research on feature selection methods in Chinese text classification, such as document frequency (DF), information gain (IG), and a chi(2)-test (CHI). On the basis of their work, we propose a new selection method of counting the unbalanced degree of term distribution, compare it with other feature selection methods using the k-nearest-neighbor (kNN) algorithm, and find that the new method performs as well as CHI and IG. Experiments have shown that whatever the feature selection method we choose, after the number of features reaches a certain value, the gain of classification accuracy becomes very slight. Keep increasing the feature dimension can hardly improve the classification performance, while the time consumed doubles. In that case, we attempts to improve the kNN method by counting the text similarity differently. The improved method will quantify each feature's weight using a bit string, count the similarity of two documents under their bits mode, and finally remarkably reduce the space required for storing documents and the time consumed by counting their similarity. Experiments have confirmed that the new kNN method can greatly accelerate the speed of classification at the expense of a little loss of classification accuracy.
What problem does this paper attempt to address?