Abstract:Recently, universal neural machine translation (NMT) with shared encoder-decoder gained good performance on zero-shot translation. Unlike universal NMT, jointly trained language-specific encoders-decoders aim to achieve universal representation across non-shared modules, each of which is for a language or language family. The non-shared architecture has the advantage of mitigating internal language competition, especially when the shared vocabulary and model parameters are restricted in their size. However, the performance of using multiple encoders and decoders on zero-shot translation still lags behind universal NMT. In this work, we study zero-shot translation using language-specific encoders-decoders. We propose to generalize the non-shared architecture and universal NMT by differentiating the Transformer layers between language-specific and interlingua. By selectively sharing parameters and applying cross-attentions, we explore maximizing the representation universality and realizing the best alignment of language-agnostic information. We also introduce a denoising auto-encoding (DAE) objective to jointly train the model with the translation task in a multi-task manner. Experiments on two public multilingual parallel datasets show that our proposed model achieves a competitive or better results than universal NMT and strong pivot baseline. Moreover, we experiment incrementally adding new language to the trained model by only updating the new model parameters. With this little effort, the zero-shot translation between this newly added language and existing languages achieves a comparable result with the model trained jointly from scratch on all languages.

Joint-training on Symbiosis Networks for Deep Nueral Machine Translation models

Multiscale Collaborative Deep Models for Neural Machine Translation

Joint Training for Neural Machine Translation Models with Monolingual Data

Shallow-to-Deep Training for Neural Machine Translation

Learning Deep Transformer Models For Machine Translation

Better Simultaneous Translation with Monotonic Knowledge Distillation.

Enhanced Neural Machine Translation by Joint Decoding with Word and POS-tagging Sequences.

Deep Transformer Modeling Via Grouping Skip Connection for Neural Machine Translation

Very Deep Transformers for Neural Machine Translation

Improving Zero-shot Neural Machine Translation on Language-specific Encoders-Decoders

Deep Fusing Pre-trained Models into Neural Machine Translation.

Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation.

Multi-channel Encoder for Neural Machine Translation

Improving Neural Machine Translation Model with Deep Encoding Information

Layer-Wise Coordination Between Encoder and Decoder for Neural Machine Translation

Multi-Task Learning with Shared Encoder for Non-Autoregressive Machine Translation

Model Embedding dimension : 400-1000 Hidden layer dimension

Depth Growing for Neural Machine Translation

Tied Transformers: Neural Machine Translation with Shared Encoder and Decoder

Training Deeper Neural Machine Translation Models with Transparent Attention

Neural System Combination For Machine Translation