Sub-word Embedding Auxiliary Encoding in Mongolian-Chinese Neural Machine Translation

Tiangang Bai,Hongxu Hou,Yatu Ji
DOI: https://doi.org/10.1145/3384544.3384565
2020-01-01
Abstract:For low-resource Mongolian-Chinese neural machine translation (NMT), the common pre-processing methods such as byte pair encoding (BPE) and tokenization, are unable to recognize Mongolian special character, which leads to the loss of complete sentence information. The translation quality of low-frequency words is undesirable due to the problem of data sparsity. In this paper, we firstly propose a process method for Mongolian special character, which can transform the Mongolian special characters into explicit form to decrease the pre-processing error. Secondly, according to the morphological knowledge of Mongolian, we generate the sub-word embedding with large scale monolingual corpus to enhance the contextual information of the representation of low-frequency words. The experiments show that 1) Mongolian special character processing can minimize the semantic loss, 2) systems with sub-word embedding from large scale monolingual corpus can capture the semantic information of low-frequency words effectively 3) the proposed approaches can improve 1-2 BLEU points above the baselines.
What problem does this paper attempt to address?