Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation.

Longkai Zhang,Houfeng Wang,Xu Sun,Mairgup Mansur
DOI: https://doi.org/10.18653/v1/d13-1031
2013-01-01
Abstract:Nowadays supervised sequence labeling models can reach competitive performance on the task of Chinese word segmentation. However, the ability of these models is restricted by the availability of annotated data and the design of features. We propose a scalable semi-supervised feature engineering approach. In contrast to previous works using pre-defined taskspecific features with fixed values, we dynamically extract representations of label distributions from both an in-domain corpus and an out-of-domain corpus. We update the representation values with a semi-supervised approach. Experiments on the benchmark datasets show that our approach achieve good results and reach an f-score of 0.961. The feature engineering approach proposed here is a general iterative semi-supervised method and not limited to the word segmentation task.
What problem does this paper attempt to address?