Spike No More: Stabilizing the Pre-training of Large Language Models

Sho Takase,Shun Kiyono,Sosuke Kobayashi,Jun Suzuki
2024-10-10
Abstract:Loss spikes often occur during pre-training of large language models. The spikes degrade the performance of large language models and sometimes ruin the pre-training. Since the pre-training needs a vast computational budget, we should avoid such spikes. Based on the assumption that the loss spike is caused by the sudden growth of the gradient norm, we explore factors to keep the gradient norm small through an analysis of the spectral norms of the Jacobian matrices for the sub-layers. Our findings suggest that stabilizing the pre-training process requires two conditions: small sub-layers and large shortcut. We conduct various experiments to empirically verify our theoretical analyses. Experimental results demonstrate that methods satisfying the conditions effectively prevent loss spikes during pre-training.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the loss spike problem that frequently occurs during the pre - training process of large - scale language models (LLMs). This sudden increase in the loss value not only degrades the performance of large - language models, but sometimes even disrupts the entire pre - training process. Since pre - training requires a large amount of computational resources, it is particularly important to avoid such loss spike phenomena. Based on the hypothesis that the loss spike is caused by the sudden growth of the gradient norm, in order to keep the gradient norm small, the author explores the factors for keeping the gradient norm small by analyzing the spectral norm of the Jacobian matrix of the Transformer sub - layer. The study finds that two conditions need to be met to stabilize the pre - training process: **small sub - layers** (that is, using smaller values when initializing sub - layer parameters) and **large shortcut connections** (that is, adjusting the standard deviation of each embedding close to 1). The author verifies these theoretical analyses through a series of experiments and shows that the methods that meet these conditions can effectively prevent the loss value and gradient spike phenomena during the pre - training process. In addition, these methods also enable LLMs to be pre - trained at a relatively high learning rate, thus achieving better performance results.