Abstract:As Deep Neural Networks (DNNs) grow in size and complexity, they often exceed the memory capacity of a single accelerator, necessitating the sharding of model parameters across multiple accelerators. Pipeline parallelism is a commonly used sharding strategy for training large DNNs. However, current implementations of pipeline parallelism are being unintentionally bottlenecked by the automatic differentiation tools provided by ML frameworks. This paper introduces 2-stage backpropagation (2BP). By splitting the backward propagation step into two separate stages, we can reduce idle compute time. We tested 2BP on various model architectures and pipelining schedules, achieving increases in throughput in all cases. Using 2BP, we were able to achieve a 1.70x increase in throughput compared to traditional methods when training a LLaMa-like transformer with 7 billion parameters across 4 GPUs.

What problem does this paper attempt to address?

This paper focuses on the issue of insufficient memory within a single accelerator due to the growing size and complexity of deep neural network (DNN) models during training. To address this problem, the paper proposes a 2-Stage Backpropagation (2BP) method. Traditional pipeline parallelism strategies encounter bottlenecks with long idle time during implementation, while 2BP reduces such idle time and improves training throughput by dividing the backpropagation steps into two independent stages. The 2BP method was tested under different model architectures and pipeline scheduling strategies, demonstrating improved training efficiency. For example, when training a transformer model similar to LLaMa with 7 billion parameters, 2BP achieved a 1.70x throughput improvement compared to traditional methods. The paper also discusses various parallel strategies, such as data parallelism, pipeline parallelism, and tensor parallelism, and analyzes their advantages and disadvantages in handling large-scale models. In addition, the paper explores the impact of 2BP on memory consumption, noting that while 2BP reduces idle time, it also increases peak memory usage. The study found significant differences in memory consumption between different model architectures and pipeline scheduling strategies. Furthermore, the paper investigates the performance of 2BP when scaled to multiple GPUs and observes that even with a fixed model size or variable global model size, the performance gains of 2BP may decrease as the number of GPUs increases, possibly due to increased communication demands. In conclusion, this paper aims to address the computational efficiency issue in large-scale DNN training by optimizing the efficiency of pipeline parallel training through the 2BP method. However, it also highlights the increased memory usage associated with this approach, providing directions for future optimization.

2BP: 2-Stage Backpropagation

BPPSA: Scaling Back-propagation by Parallel Scan Algorithm

BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training

Associated Learning: Decomposing End-to-end Backpropagation based on Auto-encoders and Target Propagation

Advances of Pipeline Model Parallelism for Deep Learning Training: An Overview

Pipelined Backpropagation at Scale: Training Large Models without Batches

PipeMare: Asynchronous Pipeline Parallel DNN Training

Interlocking Backpropagation: Improving depthwise model-parallelism

A bidirectional DNN partition mechanism for efficient pipeline parallel training in cloud

Towards accelerating model parallelism in distributed deep learning systems

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

A Pipelined Pre-Training Algorithm For Dbns

Pipe-BD: Pipelined Parallel Blockwise Distillation

PipePar: A Pipelined Hybrid Parallel Approach for Accelerating Distributed DNN Training

Pipeline-based Optimization Method for Large-Scale End-to-End Inference.

PipeDream: Fast and Efficient Pipeline Parallel DNN Training

PipeFB: An Optimized Pipeline Parallelism Scheme to Reduce the Peak Memory Usage.

vPipe: A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training

XPipe: Efficient Pipeline Model Parallelism for Multi-GPU DNN Training