A Comparative Study on Feature Weight in Text Categorization
Zhi-hong Deng,Shi-Wei Tang,Dong-Qing Yang,Ming Zhang,Li-Yu Li,Kun-Qing Xie,Ming Zhang
DOI: https://doi.org/10.1007/978-3-540-24655-8_64
2004-01-01
Abstract:Text Categorization is the process of automatically assigning predefined categories to free text documents. Feature weight, which calculates feature (term) values in documents, is one of important preprocessing techniques in text categorization. This paper is a comparative study of feature weight methods in statistical learning of text categorization. Four methods were evaluated, including tf*idf, tf*CRF, tf*OddsRatio, and tf*CHI. We have evaluated these methods on benchmark collection Reuters-21578 with Support Vector Machines (SVMs) classifiers. We found that tf*CHI is most effective in our experiments. Using tf*CHI with a SVMs classifier yielded a very high classification accuracy (87.5% for micro-average F-1 and 87.8% for micro-average break-even point). tf*idf, which is widely used in text categorization, compares favorably with tf*CRF but is not as effective as tf*CHI and tf*OddsRatio.