Forgetting Word Segmentation in Chinese Text Classification with L1-Regularized Logistic Regression.

Qiang Fu,Xinyu Dai,Shujian Huang,Jiajun Chen
DOI: https://doi.org/10.1007/978-3-642-37456-2_21
2013-01-01
Abstract:Word segmentation is commonly a preprocessing step for Chinese text representation in building a text classification system. We have found that Chinese text representation based on segmented words may lose some valuable features for classification, no matter the segmented results are correct or not. To preserve these features, we propose to use character-based N-gram to represent the Chinese text in a larger scale feature space. Considering the sparsity problem of the N-gram data, we suggest the L1-regularized logistic regression (L1-LR) model to classify Chinese text for better generalization and interpretation. The experimental results demonstrate our proposed method can get better performance than those state-of-the-art methods. Further qualitative analysis also shows that character-based N-gram representation with L1-LR is reasonable and effective for text classification. © Springer-Verlag 2013.
What problem does this paper attempt to address?