Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations.

Longkai Zhang,Li,Zhengyan He,Houfeng Wang,Ni Sun
2013-01-01
Abstract:Micro-blog is a new kind of medium which is short and informal. While no segmented corpus of micro-blogs is available to train Chinese word segmentation model, existing Chinese word segmentation tools cannot perform equally well as in ordinary news texts. In this paper we present an effective yet simple approach to Chinese word segmentation of micro-blog. In our approach, we incorporate punctuation information of unlabeled micro-blog data by introducing characters behind or ahead of punctuations, for they indicate the beginning or end of words. Meanwhile a self-training framework to incorporate confident instances is also used, which prove to be helpful. Experiments on micro-blog data show that our approach improves performance, especially in OOV-recall.
What problem does this paper attempt to address?