HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training.
Yabo Duan,Zhiquan Lai,Shengwei Li,Weijie Liu,Keshi Ge,Peng Liang,Dongsheng Li
DOI: https://doi.org/10.1109/cluster51413.2022.00043
2022-01-01
Abstract:As the deep learning model grows larger, training model with a single computational resource becomes impractical. To solve this, hybrid parallelism, which combines data and pipeline parallelism emerges to train large models with multiple GPUs. In practice, using heterogeneous GPU clusters to train large models is a common need due to the upgrade of a part of hardware. However, existing hybrid parallelism approaches in the heterogeneous environment do not work well in communication efficacy, workload balance among GPUs and utilizing the memory constrained GPU. To address these problems, we present a parallel DNN training approach, Hybrid Parallelism on Heterogeneous clusters (HPH). In HPH, we propose a topology designer that minimizes the communication time cost. Furthermore, HPH uses a partition algorithm that automatically partitions DNN layers among workers to maximize throughput. Besides, HPH adopts recomputation-aware scheduling to reduce memory consumption and further reschedule the pipeline to eliminate the extra time overhead of recomputation. Our experimental results on a 32-GPU heterogeneous cluster show that HPH achieves up to 1.42x training speed-ups compared with the state-of-the-art approach.