Fast Distributed Deep Learning Via Worker-adaptive Batch Sizing

Chen,Qizhen Weng,Wei Wang,Baochun Li,Bo Li
DOI: https://doi.org/10.1145/3267809.3275463
2018-01-01
Abstract:In heterogeneous or shared clusters, distributed learning processes are slowed down by straggling workers. In this work, we propose LB-BSP, a new synchronization scheme that eliminates stragglers by adapting each worker's training load (batch size) to its processing capability. For training in shared production clusters, a prerequisite for deciding the workers' batch sizes is to know their processing speeds before each iteration starts. To this end, we adopt NARX, an extended recurrent neural network that accounts for both the historical speeds and the driving factors such as CPU and memory in prediction.
What problem does this paper attempt to address?