Leveraging Rich Linguistic Features for Cross-domain Chinese Segmentation.

Guohua Wu,Dezhu He,Keli Zhong,Xue Zhou,Caixia Yuan
DOI: https://doi.org/10.3115/v1/w14-6816
2014-01-01
Abstract:This paper describes the system that we use for Chinese segmentation task in the 3rd CIPS-SIGHAN bakeoff.We use character sequence labeling method for segmentation, and in order to improve segmentation accuracy over multi-domain, we present a CRF-based Chinese segmentation system integrating supervised, unsupervised and lexical features.We firstly preliminarily segment the target data using CRF model trained over three types of features mentioned above, from the result of which new words are detected and absorbed into the lexicon.To generalize across different domains, we then execute the second segment with the updated lexicon.The OOV recognition is further promoted with refined post processing.All the features we used share a unified feature template trained by CRF.Our system achieves a competitive F score of 0.9730 for this bakeoff.
What problem does this paper attempt to address?