HCEC: An efficient geo-distributed deep learning training strategy based on wait-free back-propagation

Yingjie Song,Yongbao Ai,Xiong Xiao,Zhizhong Liu,Zhuo Tang,Kenli Li
DOI: https://doi.org/10.1016/j.sysarc.2024.103070
IF: 5.836
2024-01-28
Journal of Systems Architecture
Abstract:Valuable data is often distributed across multiple data centers (DCs). Deep learning (DL) tasks, constrained by privacy regulations, utilize local training and model averaging to facilitate collaborative training across multiple DCs. However, the hierarchical bandwidth within and between DCs diminishes the training efficiency for decentralized data. Therefore, it is imperative to prioritize research efforts aimed at reducing communication overhead while preserving convergence performance for geographically distributed DL tasks. To address this challenge, we propose a H igh- C onvergence and E fficient- C ommunication (HCEC) training strategy for geographically distributed data. In this paper, we adopt two approaches: (1) to ensure high convergence, we utilize dynamic learning rates and local epochs to avoid local optima; (2) to ensure efficient communication, we introduce the A daptive L ayerwise C ommunication (ALC) method to minimize inter-DC communication costs. The ALC method decides whether to communicate all L -layer model parameters at once or perform L -times communication based on the available bandwidth and computational training overhead. Experimental results show that compared to the model averaging method, HCEC ensures convergence and improves training efficiency by at most 37.9%.
computer science, software engineering, hardware & architecture
What problem does this paper attempt to address?