Towards Integrated Classification Lexicon for Handling Unknown Words in Chinese-Vietnamese Neural Machine Translation
Wanjin Che,Zhengtao Yu,Zhiqiang Yu,Yonghua Wen,Junjun Guo
DOI: https://doi.org/10.1145/3373267
2020-05-31
Abstract:In Neural Machine Translation (NMT), due to the limitations of the vocabulary, unknown words cannot be translated properly, which brings suboptimal performance of the translation system. For resource-scarce NMT that have small-scale training corpus, the effect is amplified. The traditional approach of amplifying the scale of the corpus is not applicable, because the parallel corpus is difficult to obtain in a resource-scarce setting; however, it is easy to obtain and utilize external knowledge, bilingual lexicon, and other resources. Therefore, we propose classification lexicon approach for processing unknown words in the Chinese-Vietnamese NMT task. Specifically, three types of unknown Chinese-Vietnamese words are classified and their corresponding classification lexicon are constructed by word alignment, Wikipedia extraction, and rule-based methods, respectively. After translation, the unknown words are restored by lexicon for post-processing. Experiment results on Chinese-Vietnamese, English-Vietnamese, and Mongolian-Chinese translations show that our approach significantly improves the accuracy and the performance of NMT especially in a resource-scarce setting.
computer science, artificial intelligence