A Joint Model for Unsupervised Chinese Word Segmentation.

Miaohong Chen,Baobao Chang,Wenzhe Pei
DOI: https://doi.org/10.3115/v1/d14-1092
2014-01-01
Abstract:In this paper, we propose a joint model for unsupervised Chinese word segmentation (CWS). Inspired by the “products of experts” idea, our joint model firstly combines two generative models, which are word-based hierarchical Dirichlet process model and character-based hidden Markov model, by simply multiplying their probabilities together. Gibbs sampling is used for model inference. In order to further combine the strength of goodness-based model, we then integrated nVBE into our joint model by using it to initializing the Gibbs sampler. We conduct our experiments on PKU and MSRA datasets provided by the second SIGHAN bakeoff. Test results on these two datasets show that the joint model achieves much better results than all of its component models. Statistical significance tests also show that it is significantly better than stateof-the-art systems, achieving the highest F-scores. Finally, analysis indicates that compared with nVBE and HDP, the joint model has a stronger ability to solve both combinational and overlapping ambiguities in Chinese word segmentation.
What problem does this paper attempt to address?