Effective Theory of Transformers at Initialization

Emily Dinan,Sho Yaida,Susan Zhang
2023-04-05
Abstract:We perform an effective-theory analysis of forward-backward signal propagation in wide and deep Transformers, i.e., residual neural networks with multi-head self-attention blocks and multilayer perceptron blocks. This analysis suggests particular width scalings of initialization and training hyperparameters for these models. We then take up such suggestions, training Vision and Language Transformers in practical setups.
Machine Learning,Computation and Language,High Energy Physics - Theory
What problem does this paper attempt to address?
The paper primarily explores the theoretical analysis of signal propagation in wide and deep Transformer models during the initialization phase (i.e., before the model parameters are trained). Specifically, the researchers effectively analyze the forward and backward signal propagation characteristics of residual neural networks' Multi-Head Self-Attention Blocks and Multilayer Perceptron Blocks in wide and deep Transformer models. ### Research Objectives - **Initialization Hyperparameter Width Scaling**: Determine how to adjust initialization hyperparameters based on model width (e.g., embedding dimensions) to ensure that signals maintain appropriate scales across the model layers. - **Optimizer Learning Rate Factor Width Scaling**: Provide recommendations on how to adjust the optimizer's (such as Stochastic Gradient Descent SGD and AdamW) learning rate factors based on model width. ### Research Content 1. **Statistics of Pre-Activation Values**: Analyze the statistical characteristics of pre-activation values during model initialization and propose width scaling strategies for initialization hyperparameters based on these characteristics. 2. **Neural Tangent Kernel Statistics**: Briefly introduce Neural Tangent Kernels, a tool used to describe the impact of small weight changes on the output during model initialization. 3. **Squared Gradient Statistics**: Calculate the statistical mean of squared gradients in the initialization state to derive width scaling strategies for optimizer learning rate factors. ### Methodology - **Theoretical Foundation**: First, review the basic building blocks of Transformers, including input embeddings, layer normalization, multi-head self-attention mechanisms, and multilayer perceptrons, and establish corresponding mathematical representations. - **Statistical Analysis**: Perform statistical analysis on each component to ensure signal stability during forward propagation. - **Role of Normalization Layers**: Analyze how layer normalization helps prevent signal explosion or vanishing issues. - **Self-Attention Mechanism Analysis**: Discuss in detail the working principles and statistical characteristics of the multi-head self-attention mechanism. ### Practical Applications - **Image Classification**: Use Transformer models with only the encoder for image classification tasks. - **Text Processing**: Use Transformer models with an encoder-decoder architecture for text processing tasks, such as text generation or translation. ### Conclusion Through the above theoretical analysis and practical applications, the researchers propose effective width scaling strategies for initialization hyperparameters and optimizer learning rate factors, which help improve the training efficiency and performance of wide and deep Transformer models.