Abstract:Mixture-of-Experts (MoE) based sparse architectures can significantly increase model capacity with sublinear computational overhead, which are hence widely used in massively multilingual neural machine translation (MNMT). However, they are prone to overfitting on low-resource language translation. In this paper, we propose a modularized MNMT framework that is able to flexibly assemble dense and MoE-based sparse modules to achieve the best of both worlds. The training strategy of the modularized MNMT framework consists of three stages: (1) Pre-training basic MNMT models with different training objectives or model structures, (2) Initializing modules of the framework with pre-trained couterparts (e.g., encoder, decoder and embedding layers) from the basic models and (3) Fine-tuning the modularized MNMT framework to fit modules from different models together. We pre-train three basic MNMT models from scratch: a dense model, an MoE-based sparse model and a new MoE model, termed as MoE-LGR that explores multiple Language-Group-specifc Routers to incorporate language group knowledge into MNMT. The strengths of these pre-trained models are either on low-resource language translation, high-resource language translation or zero-shot translation. Our modularized MNMT framework attempts to incorporate these advantages into a single model with reasonable initialization and fine-tuning. Experiments on widely-used benchmark datasets demonstrate that the proposed modularized MNMT framwork substantially outperforms both MoE and dense models on high- and low-resource language translation as well as zero-shot translation. Our framework facilitates the combination of different methods with their own strengths and recycling off-the-shelf models for multilingual neural machine translation. Codes are available at https://github.com/lishangjie1/MMNMT.

Multi-Layer Softmaxing during Training Neural Machine Translation for Flexible Decoding with Fewer Layers

Balancing Cost and Benefit with Tied-Multi Transformers

Training Flexible Depth Model by Multi-Task Learning for Neural Machine Translation

Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task

Recurrent Stacking of Layers for Compact Neural Machine Translation Models

Layer-Wise Multi-View Learning for Neural Machine Translation

Exploiting deep representations for neural machine translation

Coarse-to-Fine Output Predictions for Efficient Decoding in Neural Machine Translation

Multiscale Collaborative Deep Models for Neural Machine Translation

Adaptive Multi-pass Decoder for Neural Machine Translation

MMNMT: Modularizing Multilingual Neural Machine Translation with Flexibly Assembled MoE and Dense Blocks

Parallelizing and Optimizing Neural Encoder–Decoder Models Without Padding on Multi-Core Architecture

Multi-channel Encoder for Neural Machine Translation

What Works and Doesn’t Work, A Deep Decoder for Neural Machine Translation

Latent Attribute Based Hierarchical Decoder for Neural Machine Translation.

Improving Language Transfer Capability of Decoder-only Architecture in Multilingual Neural Machine Translation

Learning Language-Specific Layers for Multilingual Machine Translation

Very Deep Transformers for Neural Machine Translation

Transformer with Layer Fusion and Interaction

Efficient Context-Aware Neural Machine Translation with Layer-Wise Weighting and Input-Aware Gating.

Chunk-Based Bi-Scale Decoder for Neural Machine Translation.