3D Parallelism for Transformers Via Integer Programming

Hao Zheng,Peng Liang,Yu Tang,Yanqi Shi,Linbo Qiao,Dongsheng Li
DOI: https://doi.org/10.1109/icassp48485.2024.10446916
2024-01-01
Abstract:Transformer models, such as BERT, GPT, and ViT, have been applied to a wide range of areas in recent years, due to their efficacy. In order to improve the training efficiency of Transformer models, different distributed training approaches have been proposed, like Megatron-LM [8]. However, when multi-dimensional parallelism strategies are considered, due to the complexity, existing works can not harmonize the different strategies well enough to obtain a globally optimal solution. In this paper, we propose a parallelism strategy searching algorithm PTIP, which generates operator-level parallelism strategies consisting of three schemes: data parallelism, tensor parallelism, and pipeline parallelism. PTIP abstracts these three parallelism schemes simultaneously into an auxiliary graph, reformulates the searching problem into a mixed-integer programming (MIP) problem, and uses a MIP solver to obtain a high-quality multi-dimensional strategy. Experiments conducted on Transformers demonstrate that PTIP obtains 13.9% − 24.7% performance improvement compared to Megatron-LM [8].
What problem does this paper attempt to address?