Abstract:The rapid growth in machine learning models, especially in natural language processing and computer vision, has led to challenges when running these models on hardware with limited resources. This paper introduces Superpipeline, a new framework designed to optimize the execution of large AI models on constrained hardware during both training and inference. Our approach involves dynamically managing model execution by dividing models into individual layers and efficiently transferring these layers between GPU and CPU memory. Superpipeline reduces GPU memory usage by up to 60% in our experiments while maintaining model accuracy and acceptable processing speeds. This allows models that would otherwise exceed available GPU memory to run effectively. Unlike existing solutions that focus mainly on inference or specific model types, Superpipeline can be applied to large language models (LLMs), vision-language models (VLMs), and vision-based models. We tested Superpipeline's performance across various models and hardware setups. The method includes two key parameters that allow fine-tuning the balance between GPU memory use and processing speed. Importantly, Superpipeline does not require retraining or changing model parameters, ensuring that the original model's output remains unchanged. Superpipeline's simplicity and flexibility make it useful for researchers and professionals working with advanced AI models on limited hardware. It enables the use of larger models or bigger batch sizes on existing hardware, potentially speeding up innovation across many machine learning applications. This work marks an important step toward making advanced AI models more accessible and optimizing their deployment in resource-limited environments. The code for Superpipeline is available at <a class="link-external link-https" href="https://github.com/abbasiReza/super-pipeline" rel="external noopener nofollow">this https URL</a>.

Pipeline-based Optimization Method for Large-Scale End-to-End Inference.

Extendable Multi-Device Collaborative Pipeline Parallel Inference in the Edge-Cloud Scenario

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster

An Optimization Toolchain Design Of Deep Learning Deployment Based On Heterogeneous Computing Platform

Strategies for Optimizing End-to-End Artificial Intelligence Pipelines on Intel Xeon Processors

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

Inference Performance Optimization for Large Language Models on CPUs

Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models

Deep Learning Compiler Load Balancing Optimization Method for Model Training

Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

Inference Optimization of Foundation Models on AI Accelerators

A Data-Centric Optimization Framework for Machine Learning

Research on Convolutional Neural Network Inference Acceleration and Performance Optimization for Edge Intelligence

Accelerating End-to-End Deep Learning Workflow With Codesign of Data Preprocessing and Scheduling.

Characterizing the I/O Pipeline in the Deployment of CNNs on Commercial Accelerators

Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

Accelerating DNN Inference with Heterogeneous Multi-DPU Engines

Inference Acceleration for Large Language Models on CPUs

Hardware Accelerated Optimization of Deep Learning Model on Artificial Intelligence Chip