A new term weighting method for text categorization

LAN Man
2007-01-01
Abstract:Name: Man Lan Degree: Doctor of Philosophy Department: Department of Computer Science Thesis Title: A New Term Weighting Method for Text Categorization Abstract: Text representation is the task of transforming the content of a textual document into a compact representation of its content so that the document could be recognized and classified by a computer or a classifier. This thesis focuses on the development of an effective and efficient term weighting method for text categorization task. We selected the single token as the unit of feature because the previous researches showed that this simple type of features outperformed other complicated type of features. We have investigated several widely-used unsupervised and supervised term weighting methods on several popular data collections in combination with SVM and kNN algorithms. In consideration of the distribution of relevant documents in the collection and analysis of the term’s discriminating power, we have proposed a new term weighting scheme, namely tf. rf. The controlled experimental results showed that the term weighting methods show mixed performance in terms of different category distribution data sets and different learning algorithms. Most of the supervised term weighting methods which are based on information theory have not shown satisfactory performance according to our experimental results. However, the newly proposed tf. rf method shows a consistently better performance than other term weighting methods. On the other hand, the popularly used tf. idf method has not shown a uniformly good performance with respect to different category distribution data sets.
What problem does this paper attempt to address?