Abstract:The rapid growth in machine learning models, especially in natural language processing and computer vision, has led to challenges when running these models on hardware with limited resources. This paper introduces Superpipeline, a new framework designed to optimize the execution of large AI models on constrained hardware during both training and inference. Our approach involves dynamically managing model execution by dividing models into individual layers and efficiently transferring these layers between GPU and CPU memory. Superpipeline reduces GPU memory usage by up to 60% in our experiments while maintaining model accuracy and acceptable processing speeds. This allows models that would otherwise exceed available GPU memory to run effectively. Unlike existing solutions that focus mainly on inference or specific model types, Superpipeline can be applied to large language models (LLMs), vision-language models (VLMs), and vision-based models. We tested Superpipeline's performance across various models and hardware setups. The method includes two key parameters that allow fine-tuning the balance between GPU memory use and processing speed. Importantly, Superpipeline does not require retraining or changing model parameters, ensuring that the original model's output remains unchanged. Superpipeline's simplicity and flexibility make it useful for researchers and professionals working with advanced AI models on limited hardware. It enables the use of larger models or bigger batch sizes on existing hardware, potentially speeding up innovation across many machine learning applications. This work marks an important step toward making advanced AI models more accessible and optimizing their deployment in resource-limited environments. The code for Superpipeline is available at <a class="link-external link-https" href="https://github.com/abbasiReza/super-pipeline" rel="external noopener nofollow">this https URL</a>.

Re-evaluating the Memory-balanced Pipeline Parallelism: BPipe

AdaPipe: Optimizing Pipeline Parallelism with Adaptive Recomputation and Partitioning

Pipeline Parallelism with Controllable Memory

BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training

PipeMare: Asynchronous Pipeline Parallel DNN Training

Balancing Pipeline Parallelism with Vocabulary Parallelism

vPipe: A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training

BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers

DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines

Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training

Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator

PipeOptim: Ensuring Effective 1F1B Schedule with Optimizer-Dependent Weight Prediction

Advances of Pipeline Model Parallelism for Deep Learning Training: An Overview

Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency

2BP: 2-Stage Backpropagation

MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism

Analyzing the Performance of Graph Neural Networks with Pipe Parallelism

Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models

XPipe: Efficient Pipeline Model Parallelism for Multi-GPU DNN Training