Segmenting Long Sentence Pairs for Statistical Machine Translation

Biping Meng,Shujian Huang,Xinyu Dai,Jiajun Chen
DOI: https://doi.org/10.1109/ialp.2009.20
2009-01-01
Abstract:In phrase-based statistical machine translation, the knowledge about phrase translation and phrase reordering is learned from the bilingual corpora. However, words may be poorly aligned in long sentence pairs in practice, which will then do harm to the following steps of the translation, such as phrase extraction, etc. A possible solution to this problem is segmenting long sentence pairs into shorter ones. In this paper, we present an effective approach to segmenting sentences based on the modified IBM Translation Model 1. We find that by taking into account the semantics of some words, as well as the length ratio of source and target sentences, the segmentation result is largely improved. We also discuss the effect of length factor to the segmentation result. Experiments show that our approach can improve the BLEU score of a phrase-based translation system by about 0.5 points.
What problem does this paper attempt to address?