Abstract:The number of parameters in large-scale language models based on transformers is gradually increasing, and the scale of computing clusters is also growing. The technology of quickly mobilizing large amounts of computing resources for parallel computing is becoming increasingly important. In this paper, we propose an automatic parallel algorithm that automatically plans the parallel strategy with maximum throughput based on model and hardware information. By decoupling the training time into computation, communication, and overlap, we established a training duration simulation model. Based on this simulation model, we prune the parallel solution space to shorten the search time required. The multi-node experiment results show that the algorithm can estimate the parallel training duration in real time with an average accuracy of 96%. In our test, the recommendation strategy provided by the algorithm is always globally optimal.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: during the training process of large - scale language models (LLMs), how to automatically plan the optimal parallel strategy to maximize throughput. As the number of parameters in Transformer - based large - scale language models gradually increases, the scale of computing clusters is also expanding, and the technology for quickly mobilizing a large amount of computing resources for parallel computing is becoming more and more important. Specifically, the paper proposes an automatic parallel algorithm that can automatically plan a parallel strategy with the maximum throughput according to model and hardware information. By decomposing the training time into calculation, communication, and overlapping parts, the author establishes a training duration simulation model, and based on this model, prunes and searches the parallel solution space to shorten the time required for the search. Experimental results show that this algorithm can estimate the parallel training duration in real - time in a multi - node environment, with an average accuracy rate of 96%, and the recommended strategy provided is always globally optimal. ### Key issues: 1. **Parallel strategy optimization**: Although traditional distributed training frameworks can handle large - scale models, they lack guidance when choosing the hyper - parameters introduced by multiple parallel strategies, making it difficult for users to select appropriate hyper - parameters, increasing the time and cost of pre - experiments. 2. **High complexity**: Due to the complexity of the parallel framework, quickly and accurately finding the globally optimal parallel strategy remains a challenge. 3. **Hyper - parameter selection**: Existing methods usually assume that certain hyper - parameters (such as the global batch size and the micro - batch size) are constants, ignoring the impact of these variables on training efficiency. ### Solutions: - **Training duration simulation**: Model the parallel training duration by dividing it into three parts: calculation, communication, and overlap, and estimate the training time through operator - level analysis and modeling, with an average estimation accuracy rate of up to 96%. - **Pruning and searching**: Based on the simulation model, a pruning strategy is proposed, which can prune 99% of the search space, thereby enumerating the most effective parallel strategies within a smaller range. - **Comprehensive consideration of hyper - parameters**: Different from previous work, this algorithm covers a wider range of parallel hyper - parameters, including the global batch size and the micro - batch size, etc. Through the above methods, this paper aims to provide an efficient and accurate automatic parallel planning algorithm to help users select the optimal parallel strategy in large - scale language model training, thereby improving training efficiency and reducing costs.

Automatically Planning Optimal Parallel Strategy for Large Language Models

Efficiency optimization of large-scale language models based on deep learning in natural language processing tasks

PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training

Efficient Parallelization Layouts for Large-Scale Distributed Model Training

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Efficient Large-Scale Language Model Training on GPU Clusters

Investigation on task effect analysis and optimization strategy of multimodal large model based on Transformers architecture for various languages

An Efficient 2D Method for Training Super-Large Deep Learning Models

Enabling Parallelism Hot Switching for Efficient Training of Large Language Models

Improving Automatic Parallel Training Via Balanced Memory Workload Optimization

Planning with Large Language Models for Code Generation

Large Language Model Inference Acceleration Based on Hybrid Model Branch Prediction

3D Parallelism for Transformers Via Integer Programming

Optimizing Distributed Training on Frontier for Large Language Models

Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models

vTrain: A Simulation Framework for Evaluating Cost-effective and Compute-optimal Large Language Model Training

Training Compute-Optimal Large Language Models

Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach