Scaling Law for Language Models Training Considering Batch Size

Xian Shuai,Yiding Wang,Yimeng Wu,Xin Jiang,Xiaozhe Ren
2024-12-02
Abstract:Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress. In this paper, we empirically investigate how a critical hyper-parameter, i.e., the global batch size, influences the LLM training prdocess. We begin by training language models ranging from 125 million to 2.6 billion parameters, using up to 300 billion high-quality tokens. Through these experiments, we establish a basic scaling law on model size and training data amount. We then examine how varying batch sizes and learning rates affect the convergence and generalization of these models. Our analysis yields batch size scaling laws under two different cases: with a fixed compute budget, and with a fixed amount of training data. Extrapolation experiments on models of increasing sizes validate our predicted laws, which provides guidance for optimizing LLM training strategies under specific resource constraints.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to explore the impact of the global batch size on the model training process during the training of large - scale language models (LLMs), especially how it affects the convergence and generalization performance of the model. Specifically, researchers hope to establish the relationship between batch size, model scale, and the amount of training data through empirical research, and propose guiding principles for optimizing LLM training strategies under different resource - constrained conditions. ### Research Background In recent years, large - language models (LLMs) have made remarkable progress in natural - language understanding, generation, and reasoning, among which the scaling laws have played a crucial role in this rapid progress. However, training LLMs requires huge computational resources, which makes LLM training usually a one - time, experience - dependent process. In contrast, smaller models can comprehensively explore key parameters such as batch size and learning rate. ### Research Objectives To address this challenge, previous work has established the scaling laws for small models to guide cost - effective LLM training. However, these works have mainly focused on relatively small batch sizes. With the rapid expansion of the scale of training data and distributed computing systems, it is necessary to increase the batch size to efficiently utilize parallel computing resources and maintain a high MFU (model floating - point operation utilization). Some studies have shown that large - batch training may lead to a generalization gap and damage the final performance. Other studies have observed that there is a complex relationship between batch size, model size, training budget, and final accuracy. Therefore, this paper systematically explores the impact of batch size on LLM training. ### Main Contributions 1. **Establishing the Scaling Law Benchmark**: Researchers carefully curated a dataset containing up to 300 billion high - quality tokens and trained GPT - series models with parameters ranging from 125 million to 260 million to obtain the basic scaling laws for model size \( N \) and the amount of training data \( D \). 2. **Exploring the Impact of Large Batch Sizes**: The batch size was extended to up to 32 million tokens to study how such a large batch size affects training convergence and generalization performance. 3. **The Impact of Learning Rate**: Researchers recognized that the learning rate (LR) is closely related to the batch size. Therefore, for each batch size, three typical learning rate schemes were run to investigate the compound impact and further study the relationship between the optimal learning rate and the batch size. 4. **Extending the Basic Scaling Laws**: By introducing batch size as an additional factor, the basic scaling laws were extended. The study found that the optimal batch size can be expressed as a function of the computational budget \( C \) when \( (N, D) \) is on the computationally efficient frontier; or as a function of \( D \) when \( (N, D) \) does not necessarily meet the computationally efficient frontier conditions. 5. **Verification and Application**: Through extrapolation experiments on 430 - million - and 700 - million - parameter models, the practical validity of these laws was verified. The study provided detailed guidelines for optimizing LLM training strategies under different resource - constrained conditions. ### Conclusions Through empirical research, researchers have revealed the complex relationships between batch size, model scale, and the amount of training data, and proposed guiding principles for optimizing LLM training strategies under specific resource constraints. These findings not only help in understanding the key parameters in large - model training but also provide theoretical support for resource allocation in practical applications.