Abstract:We present a theory for the previously unexplained divergent behavior noticed in the training of large language models. We argue that the phenomenon is an artifact of the dominant optimization algorithm used for training, called Adam. We observe that Adam can enter a state in which the parameter update vector has a relatively large norm and is essentially uncorrelated with the direction of descent on the training loss landscape, leading to divergence. This artifact is more likely to be observed in the training of a deep model with a large batch size, which is the typical setting of large-scale language model training. To argue the theory, we present observations from the training runs of the language models of different scales: 7 billion, 30 billion, 65 billion, and 546 billion parameters.

What problem does this paper attempt to address?

The paper primarily explores an issue encountered in large-scale machine learning training, specifically the instability observed during the training of large language models. This instability manifests as a sudden surge in the training loss function value (referred to as "loss spikes"). The authors believe this instability is related to the widely used optimization algorithm Adam. Specifically, the research team in the paper observed that when training large-scale language models with billions or even tens of billions of parameters, there is an abnormal surge in the training loss value. To mitigate this issue, previous researchers proposed restarting the training from a checkpoint before the loss began to spike and skipping the problematic data batches. However, this paper aims to delve into the root cause of this phenomenon. The main contributions of the paper include: 1. **Theoretical Explanation**: Proposes a theory to explain why the Adam optimizer can lead to training instability under certain conditions, especially when using larger batch sizes. 2. **Experimental Validation**: Validates the theoretical predictions through experiments with models of different scales (ranging from 7 billion parameters to 546 billion parameters) and demonstrates the specific manifestations of loss surges during training. 3. **Analysis of Adam Algorithm Characteristics**: Analyzes some characteristics of the Adam algorithm, particularly how it handles the time-domain independence assumption of gradient estimates, and points out the potential consequences when this assumption is violated. 4. **Mechanism of Instability**: Provides a detailed explanation of the changes in the model state during loss surges and why certain layers are particularly susceptible to this instability. In summary, the paper attempts to reveal the root cause of training instability induced by the Adam optimizer in large-scale language model training through theoretical and empirical research, offering new perspectives for understanding and addressing this issue.

A Theory on Adam Instability in Large-Scale Machine Learning

Deconstructing What Makes a Good Optimizer for Language Models

Enhancing Stability for Large Language Models Training in Constrained Bandwidth Networks

Small-scale proxies for large-scale Transformer training instabilities

Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models

Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity

Strong Model Collapse

Measuring and Mitigating Local Instability in Deep Neural Networks

Why do universal adversarial attacks work on large language models?: Geometry might be the answer

An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models

Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve

On the instability and degeneracy of deep learning models

Studying Large Language Model Generalization with Influence Functions

Tending Towards Stability: Convergence Challenges in Small Language Models

Efficiency optimization of large-scale language models based on deep learning in natural language processing tasks

Towards Theoretically Understanding Why Sgd Generalizes Better Than Adam in Deep Learning

Variational Learning is Effective for Large Deep Networks

Methods of improving LLM training stability

Effects of Scale on Language Model Robustness

AI models collapse when trained on recursively generated data

Exploring Scaling Trends in LLM Robustness