A Layer-Based Sparsification Method for Distributed DNN Training.
Yanqing Hu,Qing Ye,Zhongyu Zhang,Jiancheng Lv
DOI: https://doi.org/10.1109/hpcc-dss-smartcity-dependsys57074.2022.00209
2022-01-01
Abstract:With the increasing size of Deep Neural Networks (DNNs) and datasets, DNN training will consume a lot of time. Various distributed strategies have been utilized to speed up the training process. Nevertheless, the frequent communication between different computational nodes dramatically limits the scale of the cluster. Currently, many gradient compression methods (e.g., Sparsification, Quantization, Low-Rank) have been proposed to optimize the transmission process. Since each layer needs to be processed and compressed respectively before transmission, a rash of time-consuming tensor operations is demanded. To alleviate this issue, we propose a layer-based sparsification method, which stems from the observation that the learning efficiency and convergence speed of different layers of DNN is different during the training process. Instead of compressing all gradients in an iteration, some layers are preferentially selected using a well-designed sliding window, which largely avoids unnecessary tensor operations. The sliding window is adjusted according to the dynamic characteristics of the parameter during the DNN training. Extensive experiments on one Text task and two Image Classification tasks are conducted to validate the efficiency of the proposed method in the same environment configuration, where the gradient compression adopts the layer-based sparsification and the other three comparative state-of-the-art methods. The total training time, communication time, and model accuracy are collected and analyzed. Sufficient experimental results demonstrate that both training and communication time can be greatly reduced, which is especially up to 60% in some cases. Additionally, despite slightly decreased within 1% model quality, the results indicate that the improvement in throughput can be as high as 50%.