A comparative study on text representation schemes in text categorization

Fengxi Song,Shuhai Liu,Jingyu Yang
DOI: https://doi.org/10.1007/s10044-005-0256-3
IF: 2.307
2005-01-01
Pattern Analysis and Applications
Abstract:It is well known that the classification effectiveness of the text categorization system is not simply a matter of learning algorithms. Text representation factors are also at work. This paper will consider the ways in which the effectiveness of text classifiers is linked to the five text representation factors: “stop words removal”, “word stemming”, “indexing”, “weighting”, and “normalization”. Statistical analyses of experimental results show that performing “normalization” can always promote effectiveness of text classifiers significantly. The effects of the other factors are not as great as expected. Contradictory to common sense, a simple binary indexing method can sometimes be helpful for text categorization.
What problem does this paper attempt to address?