Abstract:Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations. They do not suffice to scale out complex DL models on distributed compute devices. Alpa distributes the training of large DL models by viewing parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms. Based on it, Alpa constructs a new hierarchical space for massive model-parallel execution plans. Alpa designs a number of compilation passes to automatically derive efficient parallel execution plans at each parallelism level. Alpa implements an efficient runtime to orchestrate the two-level parallel execution on distributed compute devices. Our evaluation shows Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems even on models they are designed for. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans. Alpa's source code is publicly available at <a class="link-external link-https" href="https://github.com/alpa-projects/alpa" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to automatically generate efficient large - scale model parallel training plans in distributed deep learning. Specifically, existing large - scale model parallel training systems either require users to manually create parallelization plans or can only automatically generate a plan from a limited model parallel configuration space. These methods cannot fully scale the training of complex deep - learning models on distributed computing devices. Alpa solves this problem by introducing two levels of parallelism - intra - operator parallelism and inter - operator parallelism - to automatically generate execution plans. ### Main Contributions 1. **Constructed a two - level parallel execution plan space**: A hierarchical space was constructed using intra - operator parallelism and inter - operator parallelism, and the plans at each level can be independently optimized. 2. **Designed effective optimization algorithms**: Developed tractable optimization algorithms to generate approximately optimal execution plans at each level. 3. **Implemented the Alpa compiler system**: Alpa is a compiler system for distributed deep learning on GPU clusters, which includes: - A series of compilation passes that use hierarchical optimization algorithms to generate execution plans. - A new runtime architecture that coordinates inter - operator parallelism between different device grids. - Multiple system optimizations to improve the compilation process and cross - grid communication. ### Technical Details - **Intra - operator Parallelism**: Partition an operator along one or more tensor axes and assign these partitions to distributed devices. This involves choosing an appropriate parallel algorithm and reshaping the layout of the input tensor (resharding). - **Inter - operator Parallelism**: Divide the model into different stages and execute these stages in a pipelined manner on different sets of devices. This involves choosing an appropriate pipeline scheduling scheme and device allocation. ### Experimental Results - When training large models with billions of parameters, Alpa has performance comparable to or even better than that of manually - tuned dedicated systems. - On the GPT model, Alpa can match the dedicated system Megatron - LM. - On the GShard MoE model, compared to the manually - tuned system Deepspeed, Alpa achieves a 3.5 - fold speedup on 2 nodes and a 9.7 - fold speedup on 4 nodes. - Alpa can also be generalized to models without manual strategies and achieves 80% linear scaling efficiency on the Wide - ResNet model on 4 nodes. ### Summary Alpa significantly accelerates the training of large - scale deep - learning models through automated parallel plan generation, enabling researchers and developers to quickly explore new model designs without having to worry about the complex challenges of the underlying system.

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Coded Parallelism for Distributed Deep Learning.

Enabling Parallelism Hot Switching for Efficient Training of Large Language Models

A Survey on Auto-Parallelism of Large-Scale Deep Learning Training

Joint Dynamic Data and Model Parallelism for Distributed Training of DNNs over Heterogeneous Infrastructure

Aware: Adaptive Distributed Training with Computation, Communication and Position Awareness for Deep Learning Model.

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving

OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning

A Linear Algebraic Approach to Model Parallelism in Deep Learning

Proteus: Simulating the Performance of Distributed DNN Training

AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth Cost

Automatic Task Parallelization of Dataflow Graphs in ML/DL models

UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming

A Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

Model-Aware Parallelization Strategy for Deep Neural Networks' Distributed Training

HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training.

PipePar: A Pipelined Hybrid Parallel Approach for Accelerating Distributed DNN Training

Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

Fela: Incorporating Flexible Parallelism and Elastic Tuning to Accelerate Large-Scale DML

Adaptive Distributed Parallel Training Method for a Deep Learning Model Based on Dynamic Critical Paths of DAG