Abstract:DNN training is time-consuming and requires efficient multi-accelerator parallelization, where a single training iteration is split over available accelerators. Current approaches often parallelize training using intra-batch parallelization. Combining inter-batch and intra-batch pipeline parallelism is common to further improve training throughput. In this article, we develop a system, called TiMePReSt, that combines them in a novel way which helps to better overlap computation and communication, and limits the amount of communication. The traditional pipeline-parallel training of DNNs maintains similar working principle as sequential or conventional training of DNNs by maintaining consistent weight versions in forward and backward passes of a mini-batch. Thus, it suffers from high GPU memory footprint during training. In this paper, experimental study demonstrates that compromising weight consistency doesn't decrease prediction capability of a parallelly trained DNN. Moreover, TiMePReSt overcomes GPU memory overhead and achieves zero weight staleness. State-of-the-art techniques often become costly in terms of training time. In order to address this issue, TiMePReSt introduces a variant of intra-batch parallelism that parallelizes the forward pass of each mini-batch by decomposing it into smaller micro-batches. A novel synchronization method between forward and backward passes reduces training time in TiMePReSt. The occurrence of multiple sequence problem and its relation with version difference have been observed in TiMePReSt. This paper presents a mathematical relationship between the number of micro-batches and worker machines, highlighting the variation in version difference. A mathematical expression has been developed to calculate version differences for various combinations of these two without creating diagrams for all combinations.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address two major challenges in deep neural network (DNN) training: 1. **Staleness of Weights**: - In traditional pipeline parallel training, during the forward propagation of one mini-batch, another mini-batch might update the weights, causing the previous version of the weights to become stale. This staleness can affect the model's predictive ability. - To overcome this issue, TiMePReSt relaxes consistency requirements by allowing different versions of weights to be used in the forward and backward propagation of the same mini-batch, although vertical sync is still used during forward or backward propagation to maintain version consistency. 2. **Huge Training Time**: - Training deep neural networks is very time-consuming, especially in the case of multi-accelerator parallelization. Existing pipeline parallel techniques are often inefficient in terms of training time. - To reduce training time, TiMePReSt introduces a new intra-batch parallelization method, breaking each mini-batch into smaller micro-batches, thereby limiting the total computation and communication time. ### Solution Overview - **Model Architecture**: - The core architecture of TiMePReSt is a pipeline parallel mechanism where the layers of the DNN are distributed across multiple accelerators (such as GPUs), with each group of consecutive layers assigned to one accelerator. - A cluster of two machines is used, each equipped with one GPU, to balance the GPU memory consumption of all member machines. - After forward propagation ends, the last machine calculates the prediction error or loss, and backward propagation starts from the same machine, with each machine calculating the gradient of the loss with respect to the weight parameters and passing these gradients to the previous machine in the sequence. - To prevent infinite waiting and deadlock between forward and backward propagation, an nForward 1 Backward (nF1B) scheduling strategy is introduced, which is a variant of the 1F1B scheduling mechanism used in PipeDream. - **Intra-Batch Parallelization**: - Each mini-batch is divided into N smaller micro-batches, with micro-batches becoming the basic data processing unit for the entire pipeline training. - In this way, TiMePReSt can more effectively utilize computational resources, reducing training time. ### Main Contributions - **Reducing Weight Staleness**: - By allowing different versions of weights to be used in forward and backward propagation, TiMePReSt can reduce GPU memory overhead without affecting predictive ability, despite sacrificing sequential stability. - **Improving Time Efficiency**: - Through intra-batch parallelization techniques, TiMePReSt can significantly reduce training time, with experiments showing that it can complete more training epochs in the same amount of time, thereby reaching a specific accuracy faster. - **Mathematical Relationships**: - The paper also proposes mathematical relationships between the number of worker machines and the number of micro-batches, as well as mathematical expressions for version differences, allowing the calculation of version differences under different combinations without plotting all combination graphs. In summary, TiMePReSt effectively addresses the issues of weight staleness and long training times in DNN training through innovative pipeline parallel techniques and intra-batch parallelization methods.

TiMePReSt: Time and Memory Efficient Pipeline Parallel DNN Training with Removed Staleness

TiMePReSt: Time and Memory Efficient Pipeline Parallel DNN Training with Removed Staleness

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster

GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

Optimizing DNN Training with Pipeline Model Parallelism for Enhanced Performance in Embedded Systems

PipeMare: Asynchronous Pipeline Parallel DNN Training

PaSE: Parallelization Strategies for Efficient DNN Training

A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs

RTP: Rethinking Tensor Parallelism with Memory Deduplication

TSPLIT: Fine-grained GPU Memory Management for Efficient DNN Training Via Tensor Splitting

XPipe: Efficient Pipeline Model Parallelism for Multi-GPU DNN Training

Towards accelerating model parallelism in distributed deep learning systems

ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation

PipeDream: Fast and Efficient Pipeline Parallel DNN Training

BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training

2BP: 2-Stage Backpropagation

Parallel Training of Pre-Trained Models Via Chunk-Based Dynamic Memory Management