A Discriminative Latent Variable Chinese Segmenter with Hybrid Word/Character Information.
Xu Sun,Yao-zhong Zhang,Takuya Matsuzaki,Yoshimasa Tsuruoka,Jun'ichi Tsujii
DOI: https://doi.org/10.3115/1620754.1620763
2009-01-01
Abstract:Conventional approaches to Chinese word segmentation treat the problem as a character-based tagging task. Recently, semi-Markov models have been applied to the problem, incorporating features based on complete words. In this paper, we propose an alternative, a latent variable model, which uses hybrid information based on both word sequences and character sequences. We argue that the use of latent variables can help capture long range dependencies and improve the recall on segmenting long words, e.g., named-entities. Experimental results show that this is indeed the case. With this improvement, evaluations on the data of the second SIGHAN CWS bakeoff show that our system is competitive with the best ones in the literature.