Exploiting Category Information and Document Information to Improve Term Weighting for Text Categorization

Jingyang Li,Maosong Sun
DOI: https://doi.org/10.1007/978-3-540-70939-8_52
2007-01-01
Abstract:Traditional tfidf-like term weighting schemes have a rough statistic -- idfas the term weighting factor, which does not exploit the category information (category labels on documents) and intra-document information (the relative importance of a given term to a given document that contains it) from the training data for a text categorization task. We present here a more elaborate nonparametric probabilistic model to make use of this sort of information in the term weighting phase. idfis theoretically proved to be a rough approximation of this new term weighting factor. This work is preliminary and mainly aiming at providing inspiration for further study on exploitation of this information, but it already provides a moderate performance boost on three popular document collections.
What problem does this paper attempt to address?