Abstract:Training large-scale models under given resources requires careful design of parallelism strategies. In particular, the efficiency notion of critical batch size, concerning the compromise between time and compute, marks the threshold beyond which greater data parallelism leads to diminishing returns. To operationalize it, we propose a measure of CBS and pre-train a series of auto-regressive language models, ranging from 85 million to 1.2 billion parameters, on the C4 dataset. Through extensive hyper-parameter sweeps and careful control on factors such as batch size, momentum, and learning rate along with its scheduling, we systematically investigate the impact of scale on CBS. Then we fit scaling laws with respect to model and data sizes to decouple their effects. Overall, our results demonstrate that CBS scales primarily with data size rather than model size, a finding we justify theoretically through the analysis of infinite-width limits of neural networks and infinite-dimensional least squares regression. Of independent interest, we highlight the importance of common hyper-parameter choices and strategies for studying large-scale pre-training beyond fixed training durations.

What problem does this paper attempt to address?

The key problem that this paper attempts to solve is: when pre - training large - scale models, how to optimize data parallelism to improve computational efficiency. Specifically, the author focuses on the scaling behavior of "Critical Batch Size" (CBS). CBS refers to the batch size below which increasing the batch size can linearly reduce the number of optimization steps required to reach the target loss; while after exceeding this threshold, continuously increasing the batch size will not significantly improve the quality of gradient estimation, but will lead to diminishing returns. ### Main problems 1. **Understanding the scaling law of CBS**: - Research how CBS changes with the model scale and the amount of data. - Isolate the effects of the model scale and the amount of data on the growth of CBS in order to better understand their independent effects. 2. **Optimizing the pre - training process**: - By systematically studying the effects of different hyper - parameters (such as batch size, momentum, learning rate and their schedules), find the optimal training configuration. - Provide theoretical and empirical evidence to guide resource allocation and training strategy selection in large - scale pre - training. ### Method overview - **Experimental design**: Use a series of autoregressive language models (from 85M to 1.2B parameters) for pre - training on the C4 dataset, and control various factors through extensive hyper - parameter searches. - **Data analysis**: Decouple the effects of the model scale and the amount of data on CBS by fitting the scaling law. - **Theoretical analysis**: Based on the neural network theory of infinite - width limit and high - dimensional linear regression problems, provide a theoretical explanation for the scaling of CBS as the amount of data increases. ### Key findings 1. **CBS mainly depends on the amount of data rather than the model scale**: - In the case of a fixed amount of data, CBS hardly changes with the change of the model scale. - When the model scale and the amount of data are simultaneously expanded, CBS will increase accordingly. 2. **Scaling law**: - For a fixed model scale, CBS increases as the amount of data increases, conforming to a specific power - law relationship. - This finding is helpful for guiding the effective use of resources in large - scale pre - training, especially in the case of a large amount of data. 3. **Theoretical support**: - Use the neural network theory of infinite - width limit to prove that CBS will not change with the infinite increase of the model width under a fixed amount of data. - For high - dimensional linear regression problems, provide a theoretical explanation for the increase of CBS with the increase of the amount of data. ### Practical significance These findings provide important guidance for the optimization of large - scale pre - training models. Especially in the case of limited resources, they can help researchers and engineers more effectively balance computational resources, the amount of data and training time.

How Does Critical Batch Size Scale in Pre-training?

Scaling Law for Language Models Training Considering Batch Size

Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

Large Batch Training of Convolutional Networks

Exploring the Limits of Large Scale Pre-training

Enabling Large Batch Size Training for DNN Models Beyond the Memory Limit While Maintaining Performance

A Variable Batch Size Strategy for Large Scale Distributed DNN Training

Scaling Laws for Neural Language Models

Channel and filter parallelism for large-scale CNN training

Fast and accurate variable batch size convolution neural network training on large scale distributed systems

Unified Neural Network Scaling Laws and Scale-time Equivalence

Scaling Laws for Pre-training Agents and World Models

Time Transfer: On Optimal Learning Rate and Batch Size In The Infinite Data Limit

Pipelined Backpropagation at Scale: Training Large Models without Batches

The Effect of Batch Size on Contrastive Self-Supervised Speech Representation Learning

A Solvable Model of Neural Scaling Laws

Explaining Neural Scaling Laws

Unraveling the Mystery of Scaling Laws: Part I

The Effect of Network Width on the Performance of Large-batch Training

The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis

A Dynamical Model of Neural Scaling Laws