A Redundancy Based Term Weighting Approach for Text Categorization
Zhen-Yu Lu,Yong-Min Lin,Shuang Zhao,Jing-Nian Chen,Wei-Dong Zhu
DOI: https://doi.org/10.1109/WCSE.2009.191
2009-01-01
Abstract:With the rapid development of World Wide Web, text categorization has played an important role in organizing and processing large amount of text data. TF•IDF is a simple and quick term weighting method, and widely used in text categorization. But the drawback of TF•IDF is large weight may be assigned to rarely appeared terms in despite of the posterior distribution. This paper presents a redundancy based term weighting method to solve this problem by taking posterior probability distribution into consideration. Experiments on Reuters-21578 and Chinese corpus provide by Computer and Information Technology Data Center of Fudan University show that this weighting method has better performance over TF•IDF.
What problem does this paper attempt to address?