Study on the Influences of Text Categorization Performance Based on Corpus Information Measurement

Xiangdong Li,Zhichao Ba,Li Huang
DOI: https://doi.org/10.3969/j.issn.1002-1965.2014.09.028
IF: 3.1759
2014-01-01
Journal of Intelligence
Abstract:The categorization performances usually vary in different corpus data with different categorization algorithms. The article propo-ses a new method to improve the categorization performance based on the analysis of the basic reason for the difference in categorization effects of the specialized corpus and the self-built corpus. It measures the corpus information from the comparison of the automatic catego-rization performances of different corpus through defining three indexes, namely, the category clustering density, the category complexity and the category definition. And it inspects the relationship between the three indexes and the categorization performance with multiple fac-tors analysis of variance to obtain the effect relationship of the different indexes on the different algorithms categorization performances, and proposes an overlap text categorization method based on the category definition to verify the validity of the index. The experiments show that three indexes all affect the categorization performance of different algorithms to some extent. The higher clustering density, the lower complexity and the higher category definition, the better categorizationperformances will be.
What problem does this paper attempt to address?