2BP: 2-Stage Backpropagation

Christopher Rae,Joseph K. L. Lee,James Richings
2024-05-28
Abstract:As Deep Neural Networks (DNNs) grow in size and complexity, they often exceed the memory capacity of a single accelerator, necessitating the sharding of model parameters across multiple accelerators. Pipeline parallelism is a commonly used sharding strategy for training large DNNs. However, current implementations of pipeline parallelism are being unintentionally bottlenecked by the automatic differentiation tools provided by ML frameworks. This paper introduces 2-stage backpropagation (2BP). By splitting the backward propagation step into two separate stages, we can reduce idle compute time. We tested 2BP on various model architectures and pipelining schedules, achieving increases in throughput in all cases. Using 2BP, we were able to achieve a 1.70x increase in throughput compared to traditional methods when training a LLaMa-like transformer with 7 billion parameters across 4 GPUs.
Machine Learning,Artificial Intelligence,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
This paper focuses on the issue of insufficient memory within a single accelerator due to the growing size and complexity of deep neural network (DNN) models during training. To address this problem, the paper proposes a 2-Stage Backpropagation (2BP) method. Traditional pipeline parallelism strategies encounter bottlenecks with long idle time during implementation, while 2BP reduces such idle time and improves training throughput by dividing the backpropagation steps into two independent stages. The 2BP method was tested under different model architectures and pipeline scheduling strategies, demonstrating improved training efficiency. For example, when training a transformer model similar to LLaMa with 7 billion parameters, 2BP achieved a 1.70x throughput improvement compared to traditional methods. The paper also discusses various parallel strategies, such as data parallelism, pipeline parallelism, and tensor parallelism, and analyzes their advantages and disadvantages in handling large-scale models. In addition, the paper explores the impact of 2BP on memory consumption, noting that while 2BP reduces idle time, it also increases peak memory usage. The study found significant differences in memory consumption between different model architectures and pipeline scheduling strategies. Furthermore, the paper investigates the performance of 2BP when scaled to multiple GPUs and observes that even with a fixed model size or variable global model size, the performance gains of 2BP may decrease as the number of GPUs increases, possibly due to increased communication demands. In conclusion, this paper aims to address the computational efficiency issue in large-scale DNN training by optimizing the efficiency of pipeline parallel training through the 2BP method. However, it also highlights the increased memory usage associated with this approach, providing directions for future optimization.