Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations

Yu-Hui Chen,Raman Sarokin,Juhyun Lee,Jiuqiang Tang,Chuo-Ling Chang,Andrei Kulik,Matthias Grundmann

2023-06-17

Abstract:The rapid development and application of foundation models have revolutionized the field of artificial intelligence. Large diffusion models have gained significant attention for their ability to generate photorealistic images and support various tasks. On-device deployment of these models provides benefits such as lower server costs, offline functionality, and improved user privacy. However, common large diffusion models have over 1 billion parameters and pose challenges due to restricted computational and memory resources on devices. We present a series of implementation optimizations for large diffusion models that achieve the fastest reported inference latency to-date (under 12 seconds for Stable Diffusion 1.4 without int8 quantization on Samsung S23 Ultra for a 512x512 image with 20 iterations) on GPU-equipped mobile devices. These enhancements broaden the applicability of generative AI and improve the overall user experience across a wide range of devices.

Computer Vision and Pattern Recognition,Machine Learning,Image and Video Processing

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How to efficiently deploy large - scale diffusion models (such as Stable Diffusion) on mobile devices to achieve fast inference speed while overcoming the challenges of limited device computing and memory resources. Specifically, the paper points out: 1. **Existing problems**: - Large - scale diffusion models (such as Stable Diffusion 1.4) have more than 1 billion parameters, resulting in significant computing and memory limitations when deployed on mobile devices. - Running these models directly on mobile devices will lead to high latency, especially during the iterative denoising process, which will consume a large amount of memory and increase processing time. - Although there have been attempts to deploy Stable Diffusion on mobile devices, these efforts are usually limited to specific devices or chipsets, and there is still room for improvement in terms of inference latency. 2. **Objectives**: - Provide a series of optimization techniques to accelerate the inference speed of large - scale diffusion models on mobile devices equipped with GPUs. - Achieve the fastest reported inference latency (for example, the latency of generating 512×512 images on Samsung S23 Ultra is less than 12 seconds without using INT8 quantization). - Expand the application range of generative AI and enhance the user experience on various devices. Through these optimizations, the paper aims to enable large - scale diffusion models to run efficiently on a wider range of mobile devices, thereby reducing server costs, providing offline functionality and enhancing user privacy.

Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations

Squeezing Large-Scale Diffusion Models for Mobile

MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

On-Device Neural Net Inference with Mobile GPUs

Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference

Accelerated Image-Aware Generative Diffusion Modeling

SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions

SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules

Accelerating Parallel Sampling of Diffusion Models

Flash Diffusion: Accelerating Any Conditional Diffusion Model for Few Steps Image Generation

AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

AdaDiff: Accelerating Diffusion Models through Step-Wise Adaptive Computation

Choose Your Diffusion: Efficient and flexible ways to accelerate the diffusion model in fast high energy physics simulation

Accelerating Diffusion Models with Parallel Sampling: Inference at Sub-Linear Time Complexity

Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference

Accelerating Image Generation with Sub-path Linear Approximation Model

Towards More Accurate Diffusion Model Acceleration with A Timestep Aligner

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

AutoDiffusion: Training-Free Optimization of Time Steps and Architectures for Automated Diffusion Model Acceleration