Hybrid Parameter Update: Alleviating Imbalance Impacts for Distributed Deep Learning.

Hongliang Li,Dong Xu,Zhewen Xu,Xiang Li
DOI: https://doi.org/10.1109/hpcc-dss-smartcity-dependsys57074.2022.00063
2022-01-01
Abstract:Nowadays, data parallelism has been widely applied to train large datasets on distributed deep learning clusters, but it has suffered from costly global parameter updates at batch barriers. Performance imbalance among worker instances, introduced by uneven workload partitioning or biased resource allocation, can cause straggly workers, which can lead to severe impacts on both training speed and result accuracy. This paper studies the issue focusing on the tradeoff between training speed and result accuracy. We propose Cooperate Grouping Parallel (CGP), a hybrid parameter update solution that allows the flexibility of both synchronous and asynchronous update schemes. We introduce a novel Cooperate Worker Grouping Problem (CWGP) that seeks a task grouping configuration that leads to maximum model accuracy and holds customized training speed guarantees. We propose an evolution-based Pareto local searching algorithm to compute efficient worker grouping configurations. Comprehensive evaluation results are presented to demonstrate the effectiveness of CGP under both persistent and fluctuant imbalances. The proposed method alleviates the imbalance impacts without introducing extra adjustment over-heads.
What problem does this paper attempt to address?