Text Classification Method Based on Normalized Document Frequency Feature Selection

ZHAO Hongshan,FAN Guisheng,YU Huiqun
DOI: https://doi.org/10.14135/j.cnki.1006-3080.20180914005
2019-01-01
Abstract:Text classification has received great attention with the continuous accumulation of text document, which is used to automatically give a correct category mark for input text document. Feature selection is an important process of text classification and the goal of feature selection in text classification is to choose highly distinguishing features for improving the performance of a classifier. This paper mainly studies feature selection methods based on filter which sort the features by different feature selection metric and select features according to the sorting result. The traditional document frequency (DF) is a common feature selection metric based on statistics. In this method, the number of documents containing feature is used as the basis of selection, and thinking the feature is important if it appears in most documents. This however result in selecting the features that contain less category information, ignoring the correlation of features and categories. In this paper we propose an improved feature ranking metric, called normalized document frequency(NDF), which taking into account the relativity between features and categories, and introduced two normalization factors which named the number of documents based on category and the number of documents based on feature. The performance of NDF is investigated against three well known feature ranking metrics including DF, odds ratio(OR) and chi squared(CHI) on news data set in different features dimensions using naive Bayes(NB) classifier. The results show the NDF metric outperforms the three metrics in terms of macro-F1 and the accuracy(ACC) is increased by 3.5%, 2.0% and 3.4%. Therefore, NDF metric can better select valuable features which more favorable for distinguishing text category and effectively improve the performance of text classification.
What problem does this paper attempt to address?