Abstract:We propose ResIST, a novel distributed training protocol for Residual Networks (ResNets). ResIST randomly decomposes a global ResNet into several shallow sub-ResNets that are trained independently in a distributed manner for several local iterations, before having their updates synchronized and aggregated into the global model. In the next round, new sub-ResNets are randomly generated and the process repeats until convergence. By construction, per iteration, ResIST communicates only a small portion of network parameters to each machine and never uses the full model during training. Thus, ResIST reduces the per-iteration communication, memory, and time requirements of ResNet training to only a fraction of the requirements of full-model training. In comparison to common protocols, like data-parallel training and data-parallel training with local SGD, ResIST yields a decrease in communication and compute requirements, while being competitive with respect to model performance.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to improve the training efficiency of ResNet (Residual Network) in distributed training while reducing communication, memory and time costs**. Specifically, existing distributed training methods such as data - parallel training and local SGD can accelerate training to a certain extent, but still have the following problems: 1. **High communication cost**: Frequent parameter synchronization leads to a large amount of communication cost. 2. **High memory requirement**: It is necessary to store the parameters of the entire model. 3. **Long time consumption**: Since the complete model needs to be processed, the training time is long. To solve these problems, the author proposes a new distributed training protocol - **ResIST (ResNet Independent Subnetwork Training)**. This method reduces the amount of communication and calculation in each iteration by randomly decomposing the global ResNet into multiple shallow sub - networks (sub - ResNets) and performing independent distributed training among these sub - networks. The specific steps are as follows: - Randomly decompose the global ResNet into several shallow sub - networks. - Independently train these sub - networks on multiple machines for several local iterations. - Synchronize and aggregate the updates of these sub - networks into the global model. - Repeat the above process until convergence. In this way, ResIST can significantly reduce the communication, memory and time requirements in each iteration while maintaining the performance of the model. Experimental results show that ResIST is not only superior to traditional distributed training methods in communication efficiency, but also competitive in model performance. ### Main contributions 1. **Propose ResIST**: A new distributed training protocol that reduces communication cost by decomposing ResNet into multiple shallow sub - networks. 2. **Theoretical analysis**: Prove the linear convergence of ResIST on a simple ResNet architecture and show its influence by hyper - parameters (such as over - parameterization parameter m, number of worker nodes S, number of local iterations, ResNet depth H, etc.). 3. **Extensive experimental verification**: Verify the efficiency and accuracy of ResIST through experiments on multiple image classification and object detection datasets (such as CIFAR10/100, ImageNet, PascalVOC). 4. **Design choice optimization**: Determine the optimal design choices through ablation experiments, including using pre - activation ResNet, scaling the intermediate activation of the global network, sharing layers sensitive to pruning, and setting the minimum depth of sub - networks. In short, this paper aims to provide a more efficient distributed training method through ResIST to address the bottlenecks of existing methods in communication, memory and time.

ResIST: Layer-Wise Decomposition of ResNets for Distributed Training

ROG: A High Performance and Robust Distributed Training System for Robotic IoT

DyRep: Bootstrapping Training with Dynamic Re-parameterization

FRED: Flexible REduction-Distribution Interconnect and Communication Implementation for Wafer-Scale Distributed Training of DNN Models

Decentralized Proactive Model Offloading and Resource Allocation for Split and Federated Learning

Block-wise Training of Residual Networks via the Minimizing Movement Scheme

NetReduce: RDMA-Compatible In-Network Reduction for Distributed DNN Training Acceleration

Layer-Parallel Training of Residual Networks with Auxiliary-Variable Networks

Simplifying Distributed Neural Network Training on Massive Graphs: Randomized Partitions Improve Model Aggregation

Secure Distributed Training at Scale

EFLOPS: Algorithm and System Co-Design for a High Performance Distributed Training Platform

A Practical Layer-Parallel Training Algorithm for Residual Networks

Scalable Neural Network Training over Distributed Graphs

A Stage-Level Network Parallelization Method Based on Depth Decomposition

Interlocking Backpropagation: Improving depthwise model-parallelism

Distributed Newton Methods for Deep Neural Networks

RMNet: Equivalently Removing Residual Connection from Networks

RRR-Net: Reusing, Reducing, and Recycling a Deep Backbone Network

RNEP: Random Node Entropy Pairing for Efficient Decentralized Training with Non-IID Local Data

In-Network Aggregation with Transport Transparency for Distributed Training

RTP: Rethinking Tensor Parallelism with Memory Deduplication