Small-scale proxies for large-scale Transformer training instabilities

Mitchell Wortsman,Peter J. Liu,Lechao Xiao,Katie Everett,Alex Alemi,Ben Adlam,John D. Co-Reyes,Izzeddin Gur,Abhishek Kumar,Roman Novak,Jeffrey Pennington,Jascha Sohl-dickstein,Kelvin Xu,Jaehoon Lee,Justin Gilmer,Simon Kornblith
2023-10-17
Abstract:Teams that have trained large Transformer-based models have reported training instabilities at large scale that did not appear when training with the same hyperparameters at smaller scales. Although the causes of such instabilities are of scientific interest, the amount of resources required to reproduce them has made investigation difficult. In this work, we seek ways to reproduce and study training stability and instability at smaller scales. First, we focus on two sources of training instability described in previous work: the growth of logits in attention layers (Dehghani et al., 2023) and divergence of the output logits from the log probabilities (Chowdhery et al., 2022). By measuring the relationship between learning rate and loss across scales, we show that these instabilities also appear in small models when training at high learning rates, and that mitigations previously employed at large scales are equally effective in this regime. This prompts us to investigate the extent to which other known optimizer and model interventions influence the sensitivity of the final loss to changes in the learning rate. To this end, we study methods such as warm-up, weight decay, and the $\mu$Param (Yang et al., 2022), and combine techniques to train small models that achieve similar losses across orders of magnitude of learning rate variation. Finally, to conclude our exploration we study two cases where instabilities can be predicted before they emerge by examining the scaling behavior of model activation and gradient norms.
Machine Learning
What problem does this paper attempt to address?
The main focus of this paper is to discuss the issue of instability in large-scale Transformer models during the training process, which is not evident in small-scale models. The research team proposed a measurement method called "Learning Rate Sensitivity" (LR sensitivity) to evaluate the model's sensitivity to learning rate changes by analyzing the relationship between learning rate and loss in models of different scales. They found that two known instabilities - the growth of logits in the attention layer and the deviation between output logits and log probabilities - also occur when using high learning rates in small-scale models. The alleviation measures mentioned in the paper, such as qk-layer norm and z-loss regularization, are equally effective in small-scale models. In addition, the researchers also investigated the impact of other optimizers and model intervention measures on learning rate sensitivity, such as warm-up, weight decay, and μParam technique. They found that although these techniques may affect the sensitivity of the model within a specific learning rate range, their impact on the learning rate range for stable training is relatively small. They also observed that increasing the model depth increases LR sensitivity faster than increasing the width, but in the maximum scale test, independently increasing the depth can achieve lower loss. Finally, the paper proposed predicting instability by scaling the behaviors of model characteristics such as activation and gradient norms. For example, by monitoring the growth of attention logits, it is possible to predict when instability may occur in larger-scale models. The researchers also pointed out that the default epsilon hyperparameter of the AdamW optimizer may be too large, resulting in undersized updates, which is related to the instability of logits growth and parameter norm growth. In summary, the goal of the paper is to reproduce and study instability in small-scale models, providing new scientific opportunities for studying training stability without requiring a large amount of resources.