Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster

Jinghui Zhang,Jun Zhan,Jiange Li,Jiahui Jin,Lei Qian
DOI: https://doi.org/10.1002/cpe.5923
2020-07-23
Concurrency and Computation: Practice and Experience
Abstract:<p>Exorbitant resources (computing and memory) are required to train a deep neural network (DNN). Often researchers deploy an approach that uses distributed parallel training to acquire larger models faster on GPUs. This approach has its detriments, though; on one hand, a GPU's expanded capacity to compute also produces bigger bottlenecks in inter‐GPU's communications during model training, and multi‐GPU systems lead to complex connectivity. Workload schedulers then end up having to consider hardware topology and requirements for workload communication, in hopes of allocating GPU resources to optimize execution time and improve usage in a heterogeneous environment. On the other hand, the high memory requirements to train a DNN model make running the training processes on GPUs onerous. To contend with this, we introduce two execution optimization methods based on pipeline‐hybrid parallelism (using both data and model parallelism) in a GPU cluster with heterogeneous networking. First, we propose a model partition algorithm that accelerates pipeline‐hybrid parallelism training between heterogeneously network‐connected GPUs. Second, we introduce a cost‐balanced recomputing algorithm to reduce memory usage in the pipeline mode. Experiments show that our solution (Pipe‐Torch) averages a speedup of 1.4<span>×</span> compared with data parallelism, and reduces the memory footprint while maintaining pipelined load‐balanced training.</p>
What problem does this paper attempt to address?