Distributed Training Optimization for DCU

Jingming Xie,Lin Han,Jianan Li,Biaoyuan Chen,Rongcai Zhao,Pengfei Li,Wei Gao
DOI: https://doi.org/10.1109/ISPDS62779.2024.10667546
2024-05-31
Abstract:With the continuous advancement of large-scale models and expanding volumes of data, a single acceleration hardware is no longer sufficient to meet the training demands. Simply stacking multiple acceleration hardware together in parallel does not enhance training efficiency, but rather results in a waste of computing resources. Therefore, for existing distributed training parallel methods, we first establish the definition of parallel strategies and construct cost boundaries to enable the automatic selection of efficient parallel strategies. Additionally, we implement a time-balanced partition algorithm based on pipeline parallelism to achieve more equitable model partitioning at each stage and reduce the overall cost of model training time. Experiments conducted on a single machine with four DCU cards and a dual machine with eight DCU cards demonstrate that multiple models, supported by the automatic parallel strategy, exhibit outstanding distributed training performance. Simultaneously, the average acceleration ratio can reach 1.13× with pipeline optimization. In summary, automatic selection of strategies and pipeline parallelism optimization significantly enhances the efficiency of distributed training for large models.
Engineering,Computer Science
What problem does this paper attempt to address?