A Graph-based Bilingual Corpus Selection Approach for SMT.

Wen-Han Chao,Zhoujun Li
2011-01-01
Abstract:In statistical machine translation, the number of sentence pairs in the bilingual corpus is very important to the quality of translation. However, when the quantity reaches some extent, enlarging the corpus has less effect on the translation quality; whereas increasing greatly the time and space complexity to train the translation model, which hinders the development of statistical machine translation. In this paper, we propose a graph-based bilingual corpus selection approach, which makes use of the structural information of corpus to measure and update the importance of each sentence pair, and then selects a sentence pair with the highest importance each time. Our experiments in a Chinese-English translation task show that, selecting only 50% of the whole corpus by the graph-based selection approach as training set, we can obtain the near translation result with the one using the whole corpus.
What problem does this paper attempt to address?