A Novel Stochastic Gradient Descent Algorithm Based on Grouping over Heterogeneous Cluster Systems for Distributed Deep Learning
Wenbin Jiang,Geyan Ye,Laurence T. Yang,Jian Zhu,Yang Ma,Xia Xie,Hai Jin
DOI: https://doi.org/10.1109/ccgrid.2019.00053
2019-01-01
Abstract:On heterogeneous cluster systems, the convergence performances of neural network models are greatly troubled by the different performances of machines. In this paper, we propose a novel distributed Stochastic Gradient Descent (SGD) algorithm named Grouping-SGD for distributed deep learning, which converges faster than Sync-SGD, Async-SGD, and Stale-SGD. In Grouping-SGD, machines are partitioned into multiple groups, ensuring that machines in the same group have similar performances. Machines in the same group update the models synchronously, while different groups update the models asynchronously. To improve the performance of Grouping-SGD further, the parameter servers are arranged from fast to slow, and they are responsible for updating the model parameters from the lower layer to the higher layer respectively. The experimental results indicate that Grouping-SGD can achieve 1.2-3.7 times speedups using popular image classification benchmarks: MNIST, Cifar10, Cifar100, and ImageNet, compared to Sync-SGD, Async-SGD, and Stale-SGD.