Abstract:We present a theoretical model of distributed training, and use it to analyze how far dense and sparse training runs can be scaled. Under our baseline assumptions, given a three month training duration, data movement bottlenecks begin to significantly lower hardware utilization for training runs exceeding about $10^{28}$ FLOP, two orders of magnitude above the largest training run to date, suggesting the arrival of fundamental barriers to scaling in three years given recent rates of growth. A training run exceeding about $10^{31}$ FLOP is infeasible even at low utilization. However, more aggressive batch size scaling and/or shorter and fatter model shapes, if achievable, have the potential to permit much larger training runs.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to explore the limitations of the distributed training scale in the neural network training process due to data movement and latency bottlenecks under existing algorithms, GPUs, and interconnect technologies. Specifically, it attempts to answer the following two key questions: 1. **Q1: What is the maximum training scale using current technology within a fixed time?** - Given the current algorithms, GPUs, and interconnect technologies, what is the maximum amount of computation in the training process where data movement starts to significantly reduce hardware utilization or even make it impossible within a fixed time? 2. **Q2: How much can this limitation be extended?** - What algorithm or hardware advancements can achieve this extension? ### Main Findings - **A1: Maximum Training Scale under Current Technology** - According to existing technology, when the training computation exceeds approximately $10^{28}$ floating - point operations (FLOP), GPU utilization begins to decline. Based on the growth trend of 4.2 times per year in recent years, this will be reached in about three years. - **A2: Potential of Hardware and Algorithm Improvements** - Even with improved hardware interconnect technologies, the training scale can only increase by two orders of magnitude (reaching approximately $10^{30}$ FLOP). After that, due to latency issues, the training scale will be absolutely limited (approximately $10^{31}$ FLOP). To further expand the training scale, the key lies in the innovation of machine - learning algorithms, especially how to transform the serial dependencies between batch processing and layers into parallel opportunities. ### Paper Structure - **Section 2**: Introduce a simplified neural network model, consisting of stacked sparse linear multi - layer perceptron (MLP) blocks, as the basis for subsequent analysis. - **Section 3**: Outline the four main parallel strategies used in distributed training - data parallelism, tensor parallelism, pipeline parallelism, and expert parallelism, and summarize their communication costs. - **Section 4**: Identify the key constraints in distributed training, including data movement, critical batch size, latency, and model depth. - **Section 5**: Derive a closed - form expression for the maximum training scale under this model. - **Section 6**: Present a complete theoretical model, considering all identified constraints, and discuss the simulation results based on current hardware, showing the limits of efficient expansion. ### Conclusion Through theoretical analysis and simulation, the paper reveals the potential bottlenecks and technical challenges in future large - scale model training and points out possible improvement directions. These findings are of great significance for understanding the limits of future deep - learning model training. ### Key Formulas - **Total Arithmetic Cost of MLP Block**: \[ \text{MLP block's total arithmetic cost} = 2d_{\text{model}} d_{\text{ff}} b / E \] - **Number of MAC Operations for the Entire Model**: \[ \text{Total MAC for model } F = 6L d_{\text{model}} d_{\text{ff}} b \] - **Total Amount of Data Movement across GPUs (Sum)**: \[ \text{Total inter - GPU data movement} = 2[IJ(N_K - 1)+KJ(N_I - 1)+IK(N_J - 1)] \text{ words} \] These formulas help quantify the computational and communication costs under different parallel strategies, providing a theoretical basis for understanding the bottlenecks in large - scale training.

Data movement limits to frontier model training

Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

Optimizing Distributed Training on Frontier for Large Language Models

Pretraining Billion-scale Geospatial Foundational Models on Frontier

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

Reducing the Barriers to Entry for Foundation Model Training

A Look Into Training Large Language Models on Next Generation Datacenters

Decentralized Training of Foundation Models in Heterogeneous Environments

A Dynamical Model of Neural Scaling Laws

Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity

Secure Distributed Training at Scale

BFTrainer: Low-Cost Training of Neural Networks on Unfillable Supercomputer Nodes

Is the Number of Trainable Parameters All That Actually Matters?

From promise to practice: realizing high-performance decentralized training

Distributed Training Large-Scale Deep Architectures

Time Matters: Scaling Laws for Any Budget

Is Network the Bottleneck of Distributed Training?

The Power of Training: How Different Neural Network Setups Influence the Energy Demand

Pipelined Backpropagation at Scale: Training Large Models without Batches

Accelerating Data Loading in Deep Neural Network Training

An exactly solvable model for emergence and scaling laws in the multitask sparse parity problem