ResIST: Layer-Wise Decomposition of ResNets for Distributed Training

Chen Dun,Cameron R. Wolfe,Christopher M. Jermaine,Anastasios Kyrillidis
DOI: https://doi.org/10.48550/arXiv.2107.00961
2022-03-14
Abstract:We propose ResIST, a novel distributed training protocol for Residual Networks (ResNets). ResIST randomly decomposes a global ResNet into several shallow sub-ResNets that are trained independently in a distributed manner for several local iterations, before having their updates synchronized and aggregated into the global model. In the next round, new sub-ResNets are randomly generated and the process repeats until convergence. By construction, per iteration, ResIST communicates only a small portion of network parameters to each machine and never uses the full model during training. Thus, ResIST reduces the per-iteration communication, memory, and time requirements of ResNet training to only a fraction of the requirements of full-model training. In comparison to common protocols, like data-parallel training and data-parallel training with local SGD, ResIST yields a decrease in communication and compute requirements, while being competitive with respect to model performance.
Machine Learning,Computer Vision and Pattern Recognition,Distributed, Parallel, and Cluster Computing,Optimization and Control
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to improve the training efficiency of ResNet (Residual Network) in distributed training while reducing communication, memory and time costs**. Specifically, existing distributed training methods such as data - parallel training and local SGD can accelerate training to a certain extent, but still have the following problems: 1. **High communication cost**: Frequent parameter synchronization leads to a large amount of communication cost. 2. **High memory requirement**: It is necessary to store the parameters of the entire model. 3. **Long time consumption**: Since the complete model needs to be processed, the training time is long. To solve these problems, the author proposes a new distributed training protocol - **ResIST (ResNet Independent Subnetwork Training)**. This method reduces the amount of communication and calculation in each iteration by randomly decomposing the global ResNet into multiple shallow sub - networks (sub - ResNets) and performing independent distributed training among these sub - networks. The specific steps are as follows: - Randomly decompose the global ResNet into several shallow sub - networks. - Independently train these sub - networks on multiple machines for several local iterations. - Synchronize and aggregate the updates of these sub - networks into the global model. - Repeat the above process until convergence. In this way, ResIST can significantly reduce the communication, memory and time requirements in each iteration while maintaining the performance of the model. Experimental results show that ResIST is not only superior to traditional distributed training methods in communication efficiency, but also competitive in model performance. ### Main contributions 1. **Propose ResIST**: A new distributed training protocol that reduces communication cost by decomposing ResNet into multiple shallow sub - networks. 2. **Theoretical analysis**: Prove the linear convergence of ResIST on a simple ResNet architecture and show its influence by hyper - parameters (such as over - parameterization parameter m, number of worker nodes S, number of local iterations, ResNet depth H, etc.). 3. **Extensive experimental verification**: Verify the efficiency and accuracy of ResIST through experiments on multiple image classification and object detection datasets (such as CIFAR10/100, ImageNet, PascalVOC). 4. **Design choice optimization**: Determine the optimal design choices through ablation experiments, including using pre - activation ResNet, scaling the intermediate activation of the global network, sharing layers sensitive to pruning, and setting the minimum depth of sub - networks. In short, this paper aims to provide a more efficient distributed training method through ResIST to address the bottlenecks of existing methods in communication, memory and time.