Pre-trained Word Embedding Based Parallel Text Augmentation Technique for Low-Resource NMT in Favor of Morphologically Rich Languages

Tulu Tilahun Hailu,Junqing Yu,Tessfu Geteye Fantaye
DOI: https://doi.org/10.1145/3331453.3361309
2019-01-01
Abstract:Recently, neural machine translation (NMT) has made a remarkable achievement. However, performance of NMT is highly influenced by the size of training parallel text. The required amount of parallel text is not available for low-resource languages. The issue of low-resource NMT exacerbated if at least one of paired languages is morphologically rich. Inspired by the capabilities of pre-trained word embedding models, we propose simple yet effective word embedding based parallel text augmentation technique. The proposed technique is vital to enlarge size of limited parallel text for low-resource NMT. To this end, we adopt publicly available pre-trained word embedding models that favor morphologically rich languages. Accordingly, we replace each word in the original text with the n most similar words in the vector space (in this study n=3). For experimental analysis, we simulate low-resource languages by using publicly available parallel texts for German↔English, Turkish↔English and Finnish↔English language pairs. We use the original parallel texts as a seed for generating synthetic parallel texts and for training character-level NMT that deemed as baseline models. The obtained results show that our augmentation techniques improve performance of baseline and dummy source sentence based NMT models by up to 5.9 CHRF scores.
What problem does this paper attempt to address?