Automatically Planning Optimal Parallel Strategy for Large Language Models

Zongbiao Li,Xiezhao Li,Yinghao Cui,Yijun Chen,Zhixuan Gu,Yuxuan Liu,Wenbo Zhu,Fei Jia,Ke Liu,Qifeng Li,Junyao Zhan,Jiangtao Zhou,Chenxi Zhang,Qike Liu
2024-12-31
Abstract:The number of parameters in large-scale language models based on transformers is gradually increasing, and the scale of computing clusters is also growing. The technology of quickly mobilizing large amounts of computing resources for parallel computing is becoming increasingly important. In this paper, we propose an automatic parallel algorithm that automatically plans the parallel strategy with maximum throughput based on model and hardware information. By decoupling the training time into computation, communication, and overlap, we established a training duration simulation model. Based on this simulation model, we prune the parallel solution space to shorten the search time required. The multi-node experiment results show that the algorithm can estimate the parallel training duration in real time with an average accuracy of 96%. In our test, the recommendation strategy provided by the algorithm is always globally optimal.
Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: during the training process of large - scale language models (LLMs), how to automatically plan the optimal parallel strategy to maximize throughput. As the number of parameters in Transformer - based large - scale language models gradually increases, the scale of computing clusters is also expanding, and the technology for quickly mobilizing a large amount of computing resources for parallel computing is becoming more and more important. Specifically, the paper proposes an automatic parallel algorithm that can automatically plan a parallel strategy with the maximum throughput according to model and hardware information. By decomposing the training time into calculation, communication, and overlapping parts, the author establishes a training duration simulation model, and based on this model, prunes and searches the parallel solution space to shorten the time required for the search. Experimental results show that this algorithm can estimate the parallel training duration in real - time in a multi - node environment, with an average accuracy rate of 96%, and the recommended strategy provided is always globally optimal. ### Key issues: 1. **Parallel strategy optimization**: Although traditional distributed training frameworks can handle large - scale models, they lack guidance when choosing the hyper - parameters introduced by multiple parallel strategies, making it difficult for users to select appropriate hyper - parameters, increasing the time and cost of pre - experiments. 2. **High complexity**: Due to the complexity of the parallel framework, quickly and accurately finding the globally optimal parallel strategy remains a challenge. 3. **Hyper - parameter selection**: Existing methods usually assume that certain hyper - parameters (such as the global batch size and the micro - batch size) are constants, ignoring the impact of these variables on training efficiency. ### Solutions: - **Training duration simulation**: Model the parallel training duration by dividing it into three parts: calculation, communication, and overlap, and estimate the training time through operator - level analysis and modeling, with an average estimation accuracy rate of up to 96%. - **Pruning and searching**: Based on the simulation model, a pruning strategy is proposed, which can prune 99% of the search space, thereby enumerating the most effective parallel strategies within a smaller range. - **Comprehensive consideration of hyper - parameters**: Different from previous work, this algorithm covers a wider range of parallel hyper - parameters, including the global batch size and the micro - batch size, etc. Through the above methods, this paper aims to provide an efficient and accurate automatic parallel planning algorithm to help users select the optimal parallel strategy in large - scale language model training, thereby improving training efficiency and reducing costs.