Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Lianmin Zheng,Zhuohan Li,Hao Zhang,Yonghao Zhuang,Zhifeng Chen,Yanping Huang,Yida Wang,Yuanzhong Xu,Danyang Zhuo,Eric P. Xing,Joseph E. Gonzalez,Ion Stoica
DOI: https://doi.org/10.48550/arXiv.2201.12023
2022-06-29
Abstract:Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations. They do not suffice to scale out complex DL models on distributed compute devices. Alpa distributes the training of large DL models by viewing parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms. Based on it, Alpa constructs a new hierarchical space for massive model-parallel execution plans. Alpa designs a number of compilation passes to automatically derive efficient parallel execution plans at each parallelism level. Alpa implements an efficient runtime to orchestrate the two-level parallel execution on distributed compute devices. Our evaluation shows Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems even on models they are designed for. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans. Alpa's source code is publicly available at <a class="link-external link-https" href="https://github.com/alpa-projects/alpa" rel="external noopener nofollow">this https URL</a>
Machine Learning,Distributed, Parallel, and Cluster Computing,Programming Languages
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to automatically generate efficient large - scale model parallel training plans in distributed deep learning. Specifically, existing large - scale model parallel training systems either require users to manually create parallelization plans or can only automatically generate a plan from a limited model parallel configuration space. These methods cannot fully scale the training of complex deep - learning models on distributed computing devices. Alpa solves this problem by introducing two levels of parallelism - intra - operator parallelism and inter - operator parallelism - to automatically generate execution plans. ### Main Contributions 1. **Constructed a two - level parallel execution plan space**: A hierarchical space was constructed using intra - operator parallelism and inter - operator parallelism, and the plans at each level can be independently optimized. 2. **Designed effective optimization algorithms**: Developed tractable optimization algorithms to generate approximately optimal execution plans at each level. 3. **Implemented the Alpa compiler system**: Alpa is a compiler system for distributed deep learning on GPU clusters, which includes: - A series of compilation passes that use hierarchical optimization algorithms to generate execution plans. - A new runtime architecture that coordinates inter - operator parallelism between different device grids. - Multiple system optimizations to improve the compilation process and cross - grid communication. ### Technical Details - **Intra - operator Parallelism**: Partition an operator along one or more tensor axes and assign these partitions to distributed devices. This involves choosing an appropriate parallel algorithm and reshaping the layout of the input tensor (resharding). - **Inter - operator Parallelism**: Divide the model into different stages and execute these stages in a pipelined manner on different sets of devices. This involves choosing an appropriate pipeline scheduling scheme and device allocation. ### Experimental Results - When training large models with billions of parameters, Alpa has performance comparable to or even better than that of manually - tuned dedicated systems. - On the GPT model, Alpa can match the dedicated system Megatron - LM. - On the GShard MoE model, compared to the manually - tuned system Deepspeed, Alpa achieves a 3.5 - fold speedup on 2 nodes and a 9.7 - fold speedup on 4 nodes. - Alpa can also be generalized to models without manual strategies and achieves 80% linear scaling efficiency on the Wide - ResNet model on 4 nodes. ### Summary Alpa significantly accelerates the training of large - scale deep - learning models through automated parallel plan generation, enabling researchers and developers to quickly explore new model designs without having to worry about the complex challenges of the underlying system.