Adapting Conventional Chinese Word Segmenter for Segmenting Micro-blog Text: Combining Rule-based and Statistic-based Approaches.

Ning Xi,Bin Li,Guangchao Tang,Shujian Huang,Yinggong Zhao,Hao Zhou,Xinyu Dai,Jiajun Chen
2012-01-01
Abstract:We describe two adaptation strategies which are used in our word segmentation system in participating the Microblog word segmentation bake-off: Domain invariant information is extracted from the in-domain unlabelled corpus, and is incorporated as supplementary features to conventional word segmenter based on Conditional Random Field (CRF), we call it statistic-based adaptation. Some heuristic rules are further used to post-process the word segmentation result in order to better handle the characters in emoticons, name entities and special punctuation patterns which extensively exist in micro-blog text, and we call it rule-based adaptation. Experimentally, using both adaptation strategies, our system achieved 92.46 points of F-score, compared with 88.73 points of F-score of the unadapted CRF word segmenter on the pre-released development data. Our system achieved 92.51 points of F-score on the final test data.
What problem does this paper attempt to address?