Research on Improved Corpus-Level and Phrase-Level Pivot Language Based Methods in Low-Resource Machine Translation
Qiang LI,Qiang WANG,Tong XIAO,Jing-Bo ZHU
DOI: https://doi.org/10.11897/SP.J.1016.2017.00925
2017-01-01
Abstract:In this paper,we use English as the pivot language to build statistical machine translation systems as parallel training corpora for foreign languages and Chinese are non-existent.We classify the pivot language based methods into system-level,corpus-level,and phrase-level methods.For the proposed improved corpus-level method,we improve the translation performance through enlarging the size of bilingual training corpora and improving the quality of word alignments.For the typical phrase-level pivot language based method,as many high-quality phrase pairs cannot be generated from source-pivot and pivot-target phrase translation tables,we use decoding-generation method to enlarge the size of phrase pairs in phrase translation table and improve the translation performance.We analyze the strengths and weaknesses for system-level,corpus-level,and phrase-level pivot language based approaches during system construction,and we find that there is no one method can achieve the best translation performance among all the translation tasks through human analysis.Therefore we propose the corpus-phrase combination based pivot method which achieves the highest BLEU scores among all the translation tasks.We translate Bengali,Tamil,Uzbek,and Hungarian into Chinese with our proposed pivot language based methods.Finally,we observe significant improvements from 0.8 to 2.8 BLEU points when translating Bengali,Tamil,Uzbek,and Hungarian on the test datasets compared with the baseline translation system.