Abstract:Large-scale language models have become increasingly challenging and expensive to train. Among various methods addressing this issue, Pipeline Parallelism has been widely employed to accommodate massive model weights within limited GPU memory. This paper introduces Hanayo, a wave-like pipeline parallelism strategy that boasts a concise structure and practical applicability, alongside a high-performance pipeline execution runtime to tackle the challenges of pipeline strategy implementation. Hanayo mitigates the issues of pipeline bubbles and excessive memory consumption prevalent in existing schemes, without resorting to model duplicates as in Chimera. Our evaluation, conducted on four distinct computing clusters and involving both GPT-like and BERT-like architectures with up to 32 GPUs, demonstrates up to a 30.4 \% increase in throughput compared to the state-of-the-art approach.

What problem does this paper attempt to address?

This paper attempts to address several key challenges in large - language - model training: 1. **Memory Wall**: With the sharp increase in the number of model parameters, the storage capacity of a single accelerator can no longer meet the demand, resulting in model parameters significantly exceeding the storage capacity of a single accelerator. 2. **Scaling Wall**: Training large models requires the use of thousands of accelerators, which leads to complex parallel patterns and a large amount of communication overhead, thus becoming a bottleneck for scaling. 3. **Computational Wall**: Large models and large - scale datasets place extremely high demands on computing power. 4. **Development Wall**: Complex parallel strategies and manual control of the communication process make the training and development of large models extremely difficult. To meet these challenges, the paper introduces **Hanayo**, a unified framework based on the wave - like pipeline parallel strategy. The main contributions of Hanayo include: 1. **Low - bubble - ratio and high - performance**: Through a unique wave - like pipeline scheme, Hanayo achieves a low - bubble - ratio and high - throughput, and the performance is further improved as the number of waves increases. 2. **Unified framework**: Hanayo proposes a unified pipeline - parallel framework and obtains a unified performance model for pipeline parallelism through theoretical analysis. 3. **Decoupled runtime system**: When designing and implementing the runtime system, Hanayo decouples the relationship between the runtime system and specific pipeline - parallel algorithms, supports almost all pipeline - parallel algorithms using action lists, and optimizes performance through features such as asynchronous communication. 4. **Experimental verification**: The paper conducts performance tests on mainstream GPT - style and BERT - style models on four different computing clusters. The experimental results show that Hanayo improves the performance by up to 30.4% compared to Chimera, the current state - of - the - art pipeline - parallel implementation. In conclusion, through the wave - like pipeline - parallel strategy and unified framework design, Hanayo effectively addresses the memory, scaling, computational, and development challenges in large - model training and improves training efficiency.

Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training

AdaPipe: Optimizing Pipeline Parallelism with Adaptive Recomputation and Partitioning

Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training

Balancing Pipeline Parallelism with Vocabulary Parallelism

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

PipeMare: Asynchronous Pipeline Parallel DNN Training

Enabling Parallelism Hot Switching for Efficient Training of Large Language Models

Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training

Efficient Parallelization Layouts for Large-Scale Distributed Model Training

Advances of Pipeline Model Parallelism for Deep Learning Training: An Overview

Automatic Graph Partitioning for Very Large-scale Deep Learning

HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism

DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines

BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training

Efficient Large-Scale Language Model Training on GPU Clusters

Improving Large Models with Small models: Lower Costs and Better Performance

An Efficient 2D Method for Training Super-Large Deep Learning Models

Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator

Accelerating Large Language Model Training with Hybrid GPU-based Compression