Abstract:Most of the deep neural machine translation (NMT) models are based on a bottom-up feedforward fashion, in which representations in low layers construct or modulate high layers representations. We conjecture that this unidirectional encoding fashion could be a potential issue in building a deep NMT model. In this paper, we propose to build a deeper Transformer encoder by properly organizing encoder layers into multiple groups, which are connected via a grouping skip connection mechanism. Here, each group is further appropriately fed into subsequent groups to build a deep Transformer encoder. In this way, we successfully build a deep Transformer encoder with up to 48 layers. Moreover, we can share the parameters among groups to extend the encoder (virtual) depth even without introducing additional parameters. Detailed experimentation on the large-scale WMT (workshop on machine translation) 2014 English-to-German, English-to-French translation, WMT 2016 English-to-German, and WMT 2017 Chinese-to-English tasks demonstrates that our proposed deep Transformer model significantly outperforms the strong Transformer baseline. Furthermore, we carry out linguistic probing tasks to analyze the problems existing in the original Transformer model and explain how our deep Transformer encoder improves the translation quality. One particularly nice property of our approach is that it is incredibly easy to implement. We make our code available on Github https://github.com/liyc7711/deep-nmt .

Evolving transformer architecture for neural machine translation

Genetic Algorithm-based Transformer Architecture Design for Neural Machine Translation.

Transformer with Layer Fusion and Interaction

AutoTrans: Automating Transformer Design via Reinforced Architecture Search

The Evolved Transformer

Deep Transformers with Latent Depth

Searching Better Architectures for Neural Machine Translation

Optimizing the Structures of Transformer Neural Networks Using Parallel Simulated Annealing

Understanding the Difficulty of Training Transformers

Deep Transformer Modeling Via Grouping Skip Connection for Neural Machine Translation

GTrans: Grouping and Fusing Transformer Layers for Neural Machine Translation

Multi-Path Transformer is Better: A Case Study on Neural Machine Translation.

Enhancing Transformer with Horizontal and Vertical Guiding Mechanisms for Neural Language Modeling

Transformer-Dw: A Transformer Network With Dynamic And Weighted Head

Learning Deep Transformer Models For Machine Translation

Gated Residual Connection for Nerual Machine Translation

Transformer: A General Framework from Machine Translation to Others

An Augmented Transformer Architecture for Natural Language Generation Tasks

Towards Building a Strong Transformer Neural Machine Translation System

Dynamic Past and Future for Neural Machine Translation

Rethinking the Value of Transformer Components