PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training

Daiyaan Arfeen,Zhen Zhang,Xinwei Fu,Gregory R. Ganger,Yida Wang

2024-09-24

Abstract:Training Deep Neural Networks (DNNs) with billions of parameters generally involves pipeline-parallel (PP) execution. Unfortunately, PP model training can use GPUs inefficiently, especially at large scale, due to idle GPU time caused by pipeline bubbles, which are often 15-30% and can exceed 60% of the training job's GPU allocation. To improve the GPU utilization of PP model training, this paper describes PipeFill, which fills pipeline bubbles with execution of other pending jobs. By leveraging bubble GPU time, PipeFill reduces the GPU utilization sacrifice associated with scaling-up of large-model training. To context-switch between fill jobs and the main training job with minimal overhead to the main job, and maximize fill job efficiency, PipeFill carefully fits fill job work to measured bubble durations and GPU memory availability, introduces explicit pipeline-bubble instructions, and orchestrates placement and execution of fill jobs in pipeline bubbles. Experiments show that PipeFill can increase overall utilization by up to 63% for GPUs used in large-scale LLM training, with <2% slowdown of the training job, and 5-15% even for low-scale LLM training. For large-scale LLM training on 8K GPUs, the 63% increase translates to up to 2.6K additional GPUs worth of work completed.

Distributed, Parallel, and Cluster Computing,Machine Learning

What problem does this paper attempt to address?

This paper aims to address the issue of low GPU utilization during the training of large-scale language models (LLM) caused by pipeline parallelism (PP) execution. Specifically, as the model size increases and more computational nodes are required for training, the presence of data dependencies and synchronization operations leads to a significant amount of GPU idle time (i.e., "pipeline bubbles"). This phenomenon is particularly evident when scaling up the training, as the proportion of time occupied by bubbles increases with the degree of parallelism, thereby significantly reducing GPU utilization. To solve this problem, the paper proposes the PIPEFILL system, whose core idea is to utilize other independent tasks to fill these idle times during pipeline bubbles, thereby improving the overall GPU utilization. In this way, PIPEFILL can significantly enhance GPU usage efficiency without significantly affecting the performance of the main training task. Experimental results show that PIPEFILL can increase GPU utilization by up to 63% in large-scale LLM training, with an impact on training speed of less than 2%.

PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training

GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

vPipe: A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training

PipeDream: Fast and Efficient Pipeline Parallel DNN Training

HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism

BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training

XPipe: Efficient Pipeline Model Parallelism for Multi-GPU DNN Training

PipeMare: Asynchronous Pipeline Parallel DNN Training

FreeRide: Harvesting Bubbles in Pipeline Parallelism

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training

ElasticPipe

DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines

GNNPipe: Scaling Deep GNN Training with Pipelined Model Parallelism

Efficient Modeling and Real-Time Rendering of Massive Urban Pipelines Based on GPU

Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster

PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications

Accelerated Synchronous Model Parallelism Using Cooperative Process for Training Compute-Intensive Models

DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines

Faster Multi-GPU Training with PPLL: A Pipeline Parallelism Framework Leveraging Local Learning