Research on the Feature Selection Techniques Used in Text Classification.

Yan Li,Chungang Chen
DOI: https://doi.org/10.1109/fskd.2012.6234223
2012-01-01
Abstract:With the ever-increasing number of digital documents, the ability to automatically classify those documents both quickly and accurately is becoming more critical and difficult. A text classification system for Chinese documents is developed in this paper. A HTF-WDF algorithm is proposed for feature selection. Different from other feature selection algorithms, this method considers the effect of term frequency. Using the idea of fuzzy feature, the terms with high term frequency (HTF) are distinguished and appended to the feature list. The features which can represent the topic of the documents are picked out according to the weighted document frequencies (WDF), which can avoid the problems of the traditional document frequency (DF) method. Then the Support Vector Machine (SVM) is used to training the classifier. The proposed algorithm is verified by representative Chinese documents. The experiment results manifest the superiority of the proposed algorithm to the traditional DF algorithm.
What problem does this paper attempt to address?