Chinese Text Classification Without Word Segmentation

XU Yun,FAN Xiao-zhong,ZHANG Feng
DOI: https://doi.org/10.3969/j.issn.1001-0645.2005.09.007
2005-01-01
Abstract:Proposes an approach for Chinese language text classification without word segmentation based on n-gram language modeling. Unlike the case of traditional text classification models, the approach based on character level n-gram modeling avoids word segmentation and explicit feature selection procedures that tends to lose significant amount of useful information. It greatly reduces the problem of sparsity of data, because the size of the vocabulary made up of characters is smaller than that formed from words. Systematic study of key factors in language modeling and their influence on classification shows that the estimated index based on experiments on Chinese TREC attained 86.8%.
What problem does this paper attempt to address?