Accelerating Distributed DNN Training via Transport Layer Scheduling

Qingyang Duan,Chao Peng,Zeqin Wang,Yuedong Xu,Shaoteng Liu,Jun Wu,John C. S. Lui
DOI: https://doi.org/10.1109/tpds.2023.3250462
IF: 5.3
2023-04-05
IEEE Transactions on Parallel and Distributed Systems
Abstract:Communication scheduling is crucial to accelerate the training of large deep learning models, in which the transmission order of layer-wise deep neural network (DNN) tensors is determined for a better computation-communication overlap. Prior approaches adopt user-level tensor partitioning to enhance the priority scheduling with finer granularity. However, a startup time slot inserted before every tensor partition will neutralize this scheduling gain. Tuning hyper-parameters for tensor partitioning is difficult, especially when the network bandwidth is shared or time-varying in multi-tenant clusters. In this article, we propose Mercury, a simple transport layer scheduler that moves the priority scheduling to the transport layer at the packet granularity. The packets with the highest priority in the Mercury buffer will be transmitted first. Mercury achieves the near-optimal overlapping between communication and computation. It also leverages the immediate aggregation at the transport layer to enable the full overlapping of gradient push and pull. We implement Mercury in MXNet and conduct comprehensive experiments on five popular DNN models in various environments. Mercury can well adapt to dynamic communication and computation resources. Experiments show that Mercury accelerates the training by up to 130% compared to the classical PS architecture, and 104% compared to state-of-the-art tensor partitioning methods.
computer science, theory & methods,engineering, electrical & electronic
What problem does this paper attempt to address?