Multi-task Learning-based Data Augmentation for Minority Languages to Chinese Neural Machine Translation

SHEN Yingli,ZHOU Maoke,ZHAO Xiaobing
DOI: https://doi.org/10.3969/j.issn.1003-0077.2023.02.010
2023-01-01
Abstract:Neural machine translation achieves good performance in language pairs with a large parallel corpus. To deal with the fact that small bilingual parallel sentence pairs between minority langurages and Chinese, this paper proposes to implement the data augmentation into a multi-task learning framework. First, the simple transformations are performed on the target sentence, such as word order adjustment, word substitution, to produce new sentence pairs. Second, the above augmented pseudo-parallel corpus are introduced as auxiliary tasks into a multi-task learning framework to fully train the encoder, and masking the neural network pay its attertion to how to generate a richer and more accurate representation of the source language sentences in the encoder. Experiments on the CCMT 2021 dataset of Mongolian-Chinese, Tibetan-Chinese, Uyghur-Chinese, and the reverse direction show consistent improvements over the common data augmentation methods in machine translation.
What problem does this paper attempt to address?