Abstract:Teams that have trained large Transformer-based models have reported training instabilities at large scale that did not appear when training with the same hyperparameters at smaller scales. Although the causes of such instabilities are of scientific interest, the amount of resources required to reproduce them has made investigation difficult. In this work, we seek ways to reproduce and study training stability and instability at smaller scales. First, we focus on two sources of training instability described in previous work: the growth of logits in attention layers (Dehghani et al., 2023) and divergence of the output logits from the log probabilities (Chowdhery et al., 2022). By measuring the relationship between learning rate and loss across scales, we show that these instabilities also appear in small models when training at high learning rates, and that mitigations previously employed at large scales are equally effective in this regime. This prompts us to investigate the extent to which other known optimizer and model interventions influence the sensitivity of the final loss to changes in the learning rate. To this end, we study methods such as warm-up, weight decay, and the $\mu$Param (Yang et al., 2022), and combine techniques to train small models that achieve similar losses across orders of magnitude of learning rate variation. Finally, to conclude our exploration we study two cases where instabilities can be predicted before they emerge by examining the scaling behavior of model activation and gradient norms.

What problem does this paper attempt to address?

The main focus of this paper is to discuss the issue of instability in large-scale Transformer models during the training process, which is not evident in small-scale models. The research team proposed a measurement method called "Learning Rate Sensitivity" (LR sensitivity) to evaluate the model's sensitivity to learning rate changes by analyzing the relationship between learning rate and loss in models of different scales. They found that two known instabilities - the growth of logits in the attention layer and the deviation between output logits and log probabilities - also occur when using high learning rates in small-scale models. The alleviation measures mentioned in the paper, such as qk-layer norm and z-loss regularization, are equally effective in small-scale models. In addition, the researchers also investigated the impact of other optimizers and model intervention measures on learning rate sensitivity, such as warm-up, weight decay, and μParam technique. They found that although these techniques may affect the sensitivity of the model within a specific learning rate range, their impact on the learning rate range for stable training is relatively small. They also observed that increasing the model depth increases LR sensitivity faster than increasing the width, but in the maximum scale test, independently increasing the depth can achieve lower loss. Finally, the paper proposed predicting instability by scaling the behaviors of model characteristics such as activation and gradient norms. For example, by monitoring the growth of attention logits, it is possible to predict when instability may occur in larger-scale models. The researchers also pointed out that the default epsilon hyperparameter of the AdamW optimizer may be too large, resulting in undersized updates, which is related to the instability of logits growth and parameter norm growth. In summary, the goal of the paper is to reproduce and study instability in small-scale models, providing new scientific opportunities for studying training stability without requiring a large amount of resources.

Small-scale proxies for large-scale Transformer training instabilities

Methods of improving LLM training stability

Understanding the Difficulty of Training Transformers

Stabilizing Transformer Training by Preventing Attention Entropy Collapse

Tending Towards Stability: Convergence Challenges in Small Language Models

Staged Training for Transformer Language Models

Can Stability be Detrimental? Better Generalization through Gradient Descent Instabilities

Scaling Exponents Across Parameterizations and Optimizers

Measuring and Mitigating Local Instability in Deep Neural Networks

Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments

A Theory on Adam Instability in Large-Scale Machine Learning

Warmstarting for Scaling Language Models

Strong Model Collapse

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Unraveling the Mystery of Scaling Laws: Part I

A Dynamical Model of Neural Scaling Laws

Scaling ResNets in the Large-depth Regime

Global Convergence in Training Large-Scale Transformers

Initialization of Large Language Models via Reparameterization to Mitigate Loss Spikes