Accelerating Heterogeneous Tensor Parallelism via Flexible Workload Control

Zhigang Wang,Xu Zhang,Ning Wang,Chuanfei Xu,Jie Nie,Zhiqiang Wei,Yu Gu,Ge Yu
2024-01-21
Abstract:Transformer-based models are becoming deeper and larger recently. For better scalability, an underlying training solution in industry is to split billions of parameters (tensors) into many tasks and then run them across homogeneous accelerators (e.g., GPUs). However, such dedicated compute cluster is prohibitively expensive in academia and moderate companies. An economic replacement is to aggregate existing heterogeneous devices and share resources among multi-tenants. Nevertheless, static hardware configurations and dynamic resource contention definitely cause straggling tasks, which heavily slows down the overall training efficiency. Existing works feature contributions mainly tailored for traditional data parallelism. They cannot work well for the new tensor parallelism due to strict communication and correctness constraints.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?