Abstract:Transformer networks have become the preferred architecture for many tasks due to their state-of-the-art performance. However, the optimal way to implement residual connections in Transformer, which are essential for effective training, is still debated. Two widely used variants are the Post-Layer-Normalization (Post-LN) and Pre-Layer-Normalization (Pre-LN) Transformers, which apply layer normalization after each residual block's output or before each residual block's input, respectively. While both variants enjoy their advantages, they also suffer from severe limitations: Post-LN causes gradient vanishing issue that hinders training deep Transformers, and Pre-LN causes representation collapse issue that limits model capacity. In this paper, we propose ResiDual, a novel Transformer architecture with Pre-Post-LN (PPLN), which fuses the connections in Post-LN and Pre-LN together and inherits their advantages while avoids their limitations. We conduct both theoretical analyses and empirical experiments to verify the effectiveness of ResiDual. Theoretically, we prove that ResiDual has a lower bound on the gradient to avoid the vanishing issue due to the residual connection from Pre-LN. Moreover, ResiDual also has diverse model representations to avoid the collapse issue due to the residual connection from Post-LN. Empirically, ResiDual outperforms both Post-LN and Pre-LN on several machine translation benchmarks across different network depths and data sizes. Thanks to the good theoretical and empirical performance, ResiDual Transformer can serve as a foundation architecture for different AI models (e.g., large language models). Our code is available at <a class="link-external link-https" href="https://github.com/microsoft/ResiDual" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the design problem of residual connections in the Transformer model, especially the problems of gradient vanishing and representation collapse in the two variants of Post - LN (Post - Layer Normalization) and Pre - LN (Pre - Layer Normalization). Specifically: 1. **Gradient vanishing problem**: In the Post - LN architecture, since the Layer Normalization (LN) operation is carried out after the output of each residual block, the gradient gradually decays in the deep network and finally almost disappears in the lower layers. This makes it difficult to train deep Transformer models. 2. **Representation collapse problem**: In the Pre - LN architecture, although the gradient does not disappear, since the layer normalization operation is carried out before the input of each residual block, the hidden representations of the high - level blocks tend to be similar, thus limiting the capacity of the model. To overcome these problems, the paper proposes a new Transformer architecture - ResiDual. By combining the advantages of Post - LN and Pre - LN and avoiding their disadvantages at the same time. ResiDual uses the Pre - Post - LN (PPLN) method, that is, using two types of residual connections in each residual block at the same time, so as to prevent gradient vanishing and maintain representation diversity. ### Main contributions 1. **Proposing the ResiDual architecture**: This architecture solves the problems of gradient vanishing and representation collapse by combining the advantages of Post - LN and Pre - LN. 2. **Theoretical analysis**: It is proved that the ResiDual architecture has a lower bound of the gradient norm, thus avoiding the gradient vanishing problem; at the same time, the hidden representation diversity of ResiDual is the same as that of Post - LN, avoiding the representation collapse problem. 3. **Experimental verification**: Extensive experiments have been carried out on multiple machine translation tasks, and the results show that ResiDual is superior to Post - LN and Pre - LN under different network depths and data scales. ### Method overview The main design of the ResiDual architecture is as follows: - **Post - LN - style residual connection**: Layer normalization is carried out after the output of each residual block, similar to the traditional Post - LN. - **Pre - LN - style residual connection**: Layer normalization is carried out before the input of each residual block, similar to the traditional Pre - LN. - **Final output**: The results of the two residual connections are added together as the final output. ### Theoretical analysis 1. **Gradient vanishing problem**: Through theoretical analysis, it is proved that the ResiDual architecture has a lower bound of the gradient norm, thus avoiding the gradient vanishing problem. 2. **Representation collapse problem**: By analyzing the change of hidden representations, it is proved that the hidden representation diversity of the ResiDual architecture is the same as that of Post - LN, thus avoiding the representation collapse problem. ### Experimental results The paper has carried out experiments on multiple machine translation tasks, including small - scale (IWSLT), medium - scale (WMT) and large - scale (OPUS - 100) data sets. The experimental results show that ResiDual is superior to Post - LN and Pre - LN on all data sets, especially in deep - layer models. ### Conclusion The ResiDual architecture successfully solves the problems of gradient vanishing and representation collapse by combining the advantages of Post - LN and Pre - LN, providing a more effective basic architecture for the Transformer model.

ResiDual: Transformer with Dual Residual Connections

Residual: Transformer with dual residual connections

Gated Residual Connection for Nerual Machine Translation

ResiDual Transformer Alignment with Spectral Decomposition

Rewiring the Transformer with Depth-Wise LSTMs

Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation

Dual-stream Network for Visual Recognition.

Multi-split Reversible Transformers Can Enhance Neural Machine Translation.

Unified Normalization for Accelerating and Stabilizing Transformers

Transformer with Layer Fusion and Interaction

R-Transformer: Recurrent Neural Network Enhanced Transformer

Hyper-Connections

Deep Transformers with Latent Depth

Transformer in Transformer As Backbone for Deep Reinforcement Learning

Retentive Network: A Successor to Transformer for Large Language Models

ResDNet: Efficient Dense Multi-Scale Representations with Residual Learning for High-Level Vision Tasks

Dual-resolution Transformer Combined with Multi-Layer Separable Convolution Fusion Network for Real-Time Semantic Segmentation

A Residual Network with Efficient Transformer for Lightweight Image Super-Resolution

ResidualTransformer: Residual Low-Rank Learning with Weight-Sharing for Transformer Layers

RWKV: Reinventing RNNs for the Transformer Era

Adaptive Split-Fusion Transformer