Communication Patterns in Distributed Deep Learning

M. Ghobadi,A. Farhat
Abstract:Machine learning has been increasingly deployed in the cloud to take advantage of massive scaling capability as a means of reducing the time-to-accuracy of training. To this end, different machine learning training distribution frameworks are put to use, with Horovod from Uber emerging as a popular choice. To squeeze as much performance as possible from the distribution framework, it is important to maximally overlap computation and communication while maintaining high GPU utilization as a way of reducing the duration of each iteration of training. As a first step in this direction, this project sets out to study the communication component of training. We train Deep Neural Network (DNN) models of various sizes on sixteen GPUs in Google Cloud Compute Engine platform and record information about the data the workers exchange as well as the timing of each iteration of training. Our two main observations are: (i) the amount of data exchanged between workers at each training iteration is proportional to the model size; and (ii) the duration of training is not fully determined by the model size, it depends also on the compute hardware, communication bandwidth, and batch sizes in addition. The significance of these findings does not offer a complete enough picture for improving the TTA for models, but can do that in combination with information about computation.
Computer Science
What problem does this paper attempt to address?