A Theory on Adam Instability in Large-Scale Machine Learning

Igor Molybog,Peter Albert,Moya Chen,Zachary DeVito,David Esiobu,Naman Goyal,Punit Singh Koura,Sharan Narang,Andrew Poulton,Ruan Silva,Binh Tang,Diana Liskovich,Puxin Xu,Yuchen Zhang,Melanie Kambadur,Stephen Roller,Susan Zhang
2023-04-25
Abstract:We present a theory for the previously unexplained divergent behavior noticed in the training of large language models. We argue that the phenomenon is an artifact of the dominant optimization algorithm used for training, called Adam. We observe that Adam can enter a state in which the parameter update vector has a relatively large norm and is essentially uncorrelated with the direction of descent on the training loss landscape, leading to divergence. This artifact is more likely to be observed in the training of a deep model with a large batch size, which is the typical setting of large-scale language model training. To argue the theory, we present observations from the training runs of the language models of different scales: 7 billion, 30 billion, 65 billion, and 546 billion parameters.
Machine Learning,Artificial Intelligence,Optimization and Control
What problem does this paper attempt to address?
The paper primarily explores an issue encountered in large-scale machine learning training, specifically the instability observed during the training of large language models. This instability manifests as a sudden surge in the training loss function value (referred to as "loss spikes"). The authors believe this instability is related to the widely used optimization algorithm Adam. Specifically, the research team in the paper observed that when training large-scale language models with billions or even tens of billions of parameters, there is an abnormal surge in the training loss value. To mitigate this issue, previous researchers proposed restarting the training from a checkpoint before the loss began to spike and skipping the problematic data batches. However, this paper aims to delve into the root cause of this phenomenon. The main contributions of the paper include: 1. **Theoretical Explanation**: Proposes a theory to explain why the Adam optimizer can lead to training instability under certain conditions, especially when using larger batch sizes. 2. **Experimental Validation**: Validates the theoretical predictions through experiments with models of different scales (ranging from 7 billion parameters to 546 billion parameters) and demonstrates the specific manifestations of loss surges during training. 3. **Analysis of Adam Algorithm Characteristics**: Analyzes some characteristics of the Adam algorithm, particularly how it handles the time-domain independence assumption of gradient estimates, and points out the potential consequences when this assumption is violated. 4. **Mechanism of Instability**: Provides a detailed explanation of the changes in the model state during loss surges and why certain layers are particularly susceptible to this instability. In summary, the paper attempts to reveal the root cause of training instability induced by the Adam optimizer in large-scale language model training through theoretical and empirical research, offering new perspectives for understanding and addressing this issue.