Abstract:We present a theoretical model of distributed training, and use it to analyze how far dense and sparse training runs can be scaled. Under our baseline assumptions, given a three month training duration, data movement bottlenecks begin to significantly lower hardware utilization for training runs exceeding about $10^{28}$ FLOP, two orders of magnitude above the largest training run to date, suggesting the arrival of fundamental barriers to scaling in three years given recent rates of growth. A training run exceeding about $10^{31}$ FLOP is infeasible even at low utilization. However, more aggressive batch size scaling and/or shorter and fatter model shapes, if achievable, have the potential to permit much larger training runs.
Distributed, Parallel, and Cluster Computing,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve?
This paper aims to explore the limitations of the distributed training scale in the neural network training process due to data movement and latency bottlenecks under existing algorithms, GPUs, and interconnect technologies. Specifically, it attempts to answer the following two key questions:
1. **Q1: What is the maximum training scale using current technology within a fixed time?**
- Given the current algorithms, GPUs, and interconnect technologies, what is the maximum amount of computation in the training process where data movement starts to significantly reduce hardware utilization or even make it impossible within a fixed time?
2. **Q2: How much can this limitation be extended?**
- What algorithm or hardware advancements can achieve this extension?
### Main Findings
- **A1: Maximum Training Scale under Current Technology**
- According to existing technology, when the training computation exceeds approximately \(10^{28}\) floating - point operations (FLOP), GPU utilization begins to decline. Based on the growth trend of 4.2 times per year in recent years, this will be reached in about three years.
- **A2: Potential of Hardware and Algorithm Improvements**
- Even with improved hardware interconnect technologies, the training scale can only increase by two orders of magnitude (reaching approximately \(10^{30}\) FLOP). After that, due to latency issues, the training scale will be absolutely limited (approximately \(10^{31}\) FLOP). To further expand the training scale, the key lies in the innovation of machine - learning algorithms, especially how to transform the serial dependencies between batch processing and layers into parallel opportunities.
### Paper Structure
- **Section 2**: Introduce a simplified neural network model, consisting of stacked sparse linear multi - layer perceptron (MLP) blocks, as the basis for subsequent analysis.
- **Section 3**: Outline the four main parallel strategies used in distributed training - data parallelism, tensor parallelism, pipeline parallelism, and expert parallelism, and summarize their communication costs.
- **Section 4**: Identify the key constraints in distributed training, including data movement, critical batch size, latency, and model depth.
- **Section 5**: Derive a closed - form expression for the maximum training scale under this model.
- **Section 6**: Present a complete theoretical model, considering all identified constraints, and discuss the simulation results based on current hardware, showing the limits of efficient expansion.
### Conclusion
Through theoretical analysis and simulation, the paper reveals the potential bottlenecks and technical challenges in future large - scale model training and points out possible improvement directions. These findings are of great significance for understanding the limits of future deep - learning model training.
### Key Formulas
- **Total Arithmetic Cost of MLP Block**:
\[
\text{MLP block's total arithmetic cost} = 2d_{\text{model}} d_{\text{ff}} b / E
\]
- **Number of MAC Operations for the Entire Model**:
\[
\text{Total MAC for model } F = 6L d_{\text{model}} d_{\text{ff}} b
\]
- **Total Amount of Data Movement across GPUs (Sum)**:
\[
\text{Total inter - GPU data movement} = 2[IJ(N_K - 1)+KJ(N_I - 1)+IK(N_J - 1)] \text{ words}
\]
These formulas help quantify the computational and communication costs under different parallel strategies, providing a theoretical basis for understanding the bottlenecks in large - scale training.