Abstract:The use of subword embedding has proved to be a major innovation in Neural Machine Translation (NMT). It helps NMT to learn better context vectors for Low Resource Languages (LRLs) so as to predict the target words by better modelling the morphologies of the two languages and also the morphosyntax transfer. Even so, their performance for translation in Indian language to Indian language scenario is still not as good as for resource-rich languages. One reason for this is the relative morphological richness of Indian languages, while another is that most of them fall into the extremely low resource or zero-shot categories. Since most major Indian languages use Indic or Brahmi origin scripts, the text written in them is highly phonetic in nature and phonetically similar in terms of abstract letters and their arrangements. We use these characteristics of Indian languages and their scripts to propose an approach based on common multilingual Latin-based encodings (WX notation) that take advantage of language similarity while addressing the morphological complexity issue in NMT. These multilingual Latin-based encodings in NMT, together with Byte Pair Embedding (BPE) allow us to better exploit their phonetic and orthographic as well as lexical similarities to improve the translation quality by projecting different but similar languages on the same orthographic-phonetic character space. We verify the proposed approach by demonstrating experiments on similar language pairs (Gujarati-Hindi, Marathi-Hindi, Nepali-Hindi, Maithili-Hindi, Punjabi-Hindi, and Urdu-Hindi) under low resource conditions. The proposed approach shows an improvement in a majority of cases, in one case as much as ~10 BLEU points compared to baseline techniques for similar language pairs. We also get up to ~1 BLEU points improvement on distant and zero-shot language pairs.

Sub-word Embedding Auxiliary Encoding in Mongolian-Chinese Neural Machine Translation

SubCharacter Chinese-English Neural Machine Translation with Wubi encoding

Finding Better Subword Segmentation for Neural Machine Translation

Research on the LSTM Mongolian and Chinese machine translation based on morpheme encoding

Adversarial Training for Unknown Word Problems in Neural Machine Translation

Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations

Integrating Multi-scale Contextualized Information for Byte-based Neural Machine Translation

Neural Machine Translation of Logographic Languages Using Sub-character Level Information

Shared-Private Bilingual Word Embeddings for Neural Machine Translation

Machine Translation by Projecting Text into the Same Phonetic-Orthographic Space Using a Common Encoding

Character-Aware Decoder for Translation into Morphologically Rich Languages

Enhancing Neural Machine Translation with Semantic Units

Local Byte Fusion for Neural Machine Translation

A Deep Investigation of RNN and Self-attention for the Cyrillic-Traditional Mongolian Bidirectional Conversion

Improving Character-based Decoding Using Target-Side Morphological Information for Neural Machine Translation

Improving Neural Machine Translation Model with Deep Encoding Information

Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than Character Level

Multi-channel Encoder for Neural Machine Translation

Memory-augmented Chinese-Uyghur neural machine translation

When is Char Better Than Subword: A Systematic Study of Segmentation Algorithms for Neural Machine Translation

Multimodal Machine Translation with Embedding Prediction