Abstract:Text classification has received great attention with the continuous accumulation of text document, which is used to automatically give a correct category mark for input text document. Feature selection is an important process of text classification and the goal of feature selection in text classification is to choose highly distinguishing features for improving the performance of a classifier. This paper mainly studies feature selection methods based on filter which sort the features by different feature selection metric and select features according to the sorting result. The traditional document frequency (DF) is a common feature selection metric based on statistics. In this method, the number of documents containing feature is used as the basis of selection, and thinking the feature is important if it appears in most documents. This however result in selecting the features that contain less category information, ignoring the correlation of features and categories. In this paper we propose an improved feature ranking metric, called normalized document frequency(NDF), which taking into account the relativity between features and categories, and introduced two normalization factors which named the number of documents based on category and the number of documents based on feature. The performance of NDF is investigated against three well known feature ranking metrics including DF, odds ratio(OR) and chi squared(CHI) on news data set in different features dimensions using naive Bayes(NB) classifier. The results show the NDF metric outperforms the three metrics in terms of macro-F1 and the accuracy(ACC) is increased by 3.5%, 2.0% and 3.4%. Therefore, NDF metric can better select valuable features which more favorable for distinguishing text category and effectively improve the performance of text classification.

Feature Selection Based on Absolute Deviation Factor for Text Classification

CLDA: Feature Selection for Text Categorization Based on Constrained LDA

Feature Selection Method on Imbalanced Text

A New Feature Selection Method for Text Classification Based on Independent Feature Space Search

New Feature Selection Approach(cdf) for Text Categorization

Aggressive Dimensionality Reduction With Reinforcement Local Feature Selection For Text Categorization

Relative Term-Frequency Based Feature Selection for Text Categorization

Text Classification Method Based on Normalized Document Frequency Feature Selection

Feature Selection Method Based on Improved Document Frequency

Efficient Method for Feature Selection in Text Classification

De-redundancy Relative Discrimination Criterion-based Feature Selection for Text Data

Feature selection based on a normalized difference measure for text classification

A General Framework of Feature Selection for Text Categorization

Learning Effective Features for Chinese Text Categorization

Document Feature Selection Based on the Minimum Term Frequency Threshold

Inter-Category Distribution Enhanced Feature Extraction for Efficient Text Classification

Feature selection method based on category discriminability

Improved Document Feature Selection with Categorical Parameter for Text Classification.

An Improved Chinese Text Classification Algorithm Based On Multiple Feature Factors

A New Feature Selection Based on Comprehensive Measurement Both in Inter-Category and Intra-Category for Text Categorization

Feature reduction methods for text classification