Abstract:Asynchronous pipeline model parallelism with a "1F1B" (one forward, one backward) schedule generates little bubble overhead and always provides quite a high throughput. However, the "1F1B" schedule inevitably leads to weight inconsistency and weight staleness issues due to the cross-training of different mini-batches across GPUs. To simultaneously address these two problems, in this paper, we propose an optimizer-dependent weight prediction strategy (a.k.a PipeOptim) for asynchronous pipeline training. The key insight of our proposal is that we employ a weight prediction strategy in the forward pass to ensure that each mini-batch uses consistent and staleness-free weights to compute the forward pass. To be concrete, we first construct the weight prediction scheme based on the update rule of the used optimizer when training the deep neural network models. Then throughout the "1F1B" pipelined training, each mini-batch is mandated to execute weight prediction ahead of the forward pass, subsequently employing the predicted weights to perform the forward pass. As a result, PipeOptim 1) inherits the advantage of the "1F1B" schedule and generates pretty high throughput, and 2) can ensure effective parameter learning regardless of the type of the used optimizer. To verify the effectiveness of our proposal, we conducted extensive experimental evaluations using eight different deep-learning models spanning three machine-learning tasks including image classification, sentiment analysis, and machine translation. The experiment results demonstrate that PipeOptim outperforms the popular pipelined approaches including GPipe, PipeDream, PipeDream-2BW, and SpecTrain. The code of PipeOptim can be accessible at <a class="link-external link-https" href="https://github.com/guanleics/PipeOptim" rel="external noopener nofollow">this https URL</a>.

PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications

vPipe: A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training

Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster

BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training

BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training

PipeDream: Fast and Efficient Pipeline Parallel DNN Training

WidePipe: High-Throughput Deep Learning Inference System on a Cluster of Neural Processing Units

XPipe: Efficient Pipeline Model Parallelism for Multi-GPU DNN Training

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism

PipeOrgan: Efficient Inter-operation Pipelining with Flexible Spatial Organization and Interconnects

Versapipe: a versatile programming framework for pipelined computing on GPU.

PipeOptim: Ensuring Effective 1F1B Schedule with Optimizer-Dependent Weight Prediction

PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

H2PIPE: High throughput CNN Inference on FPGAs with High-Bandwidth Memory

Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs.

ElasticPipe

Pie: A Pipeline Energy-Efficient Accelerator for Inference Process in Deep Neural Networks