Document Feature Selection Based on the Minimum Term Frequency Threshold

Xiao-Yun CHEN,Rong-Lu LI,Yun-Fa HU
DOI: https://doi.org/10.3969/j.issn.1003-6059.2006.04.018
2006-01-01
Pattern Recognition and Artificial Intelligence
Abstract:In this paper, a novel method of feature evaluation function based on document frequency with the minimum term frequency threshold (DF_n) is presented. To decrease the influence of the unrelated features on the system of text categorization, the attribute of the unrelated features is analyzed and the term frequency of the unrelated feature is commonly low. By applying minimum term frequency to filter the low frequency features, the unrelated features are obviously decreased. The experimental results validate the proposed method greatly reduces the number of the unrelated features and effectively improves the accuracy of the text categorization. The improvement to Mutual Information(MI) is very obvious, the Macro-average F1 value based on DF_n is 40% higher than that of Term Frequency, and 15~30% higher than that of Document Frequency(DF).
What problem does this paper attempt to address?