Exploring Multiple Chinese Word Segmentation Results Based on Linear Model

Chen Su,Yujie Zhang,Zhen Guo,Jinan Xu
DOI: https://doi.org/10.1007/978-3-642-41644-6_6
2013-01-01
Abstract:In the process of developing a domain-specific Chinese-English machine translation system, the accuracy of Chinese word segmentation on large amounts of training text often decreases because of unknown words. The lack of domain-specific annotated corpus makes supervised learning approaches unable to adapt to a target domain. This problem results in many errors in translation knowledge extraction and therefore seriously lowers translation quality. To solve the domain adaptation problem, we implement Chinese word segmentation by exploring n-gram statistical features in large Chinese raw corpus and bilingually motivated Chinese word segmentation, respectively. Moreover, we propose a method of combining multiple Chinese word segmentation results based on linear model to augment domain adaptation. For evaluation, we conduct experiments of Chinese word segmentation and Chinese-English machine translation using the data of NTCIR-10 Chinese-English patent task. The experimental results showed that the proposed method achieves improvements in both F-measure of the Chinese word segmentation and BLEU score of the Chinese-English statistical machine translation system.
What problem does this paper attempt to address?