Abstract:Diffusion models (DMs) have emerged as a powerful category of generative models with record-breaking performance in image synthesis [1]. A noisy image created from pure Gaussian random variables needs to be denoised by iterative DMs to ensure generative quality. For DMs, quantizing activations to integers (INT) degrades image quality due to changes in activation distributions and the accumulation of quantization errors across iterations. A GPU (Nvidia A100) requires 2560 ms and 250 W to generate a $256 \times 256$ image through 50 iterations of a floating-point (FP) DM. Two adjacent denoised images bring similar visual effects, where the difference between pixels at the same position is very small. As a result, for two adjacent DMs, most input differences within the same layer $(\Delta IN)$ are consistently clustered within a narrow range, indicating that most $\Delta IN$ can be quantized as INT-data. The remaining $\Delta I N$ have relatively large values, whose distributions vary across iterative DMs. To ensure generative quality, a complete $\Delta IN$ tensor is divided into a dense INT tensor (INT-$\Delta$ IN) and a sparse FP tensor (FP-$\Delta$). Compute-in-memory (CIM) has shown high throughput and energy efficiency on INT multiply-and-accumulate (MAC), demonstrating its potential to process $\triangle IN$ efficiently. However, prior CIM chips face three challenges in speeding up on-device image generation to seconds with low power consumption [2–7]. First, conventional CIM chips perform MACs with bit-serial inputs, leading to significant runtime. A recent CIM chip incorporates an additional adder-tree to handle one more input bit, albeit at the cost of 85.4 % more power and 82.5 % more area [3]. Second, CIM chips cannot process FP data at high speed like INT data. They either requires repeated reads/writes to handle high-precision mantissas [4], or face lengthy alignment-cycle latencies [5]. Third, previous FP CIMs do not support identifying and utilizing stored sparse data, leading to redundant computations.

A 28.6 mJ/iter Stable Diffusion Processor for Text-to-Image Generation with Patch Similarity-based Sparsity Augmentation and Text-based Mixed-Precision

Emage: Non-Autoregressive Text-to-Image Generation

EdgeFusion: On-Device Text-to-Image Generation

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

A 3.89-Gops/mw Scalable Recurrent Neural Network Processor with Improved Efficiency on Memory and Computation

Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models

A 52.01 TFLOPS/W Diffusion Model Processor with Inter-Time-Step Convolution-Attention-Redundancy Elimination and Bipolar Floating-Point Multiplication

SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules

BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

20.2 A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based Accelerator Exploiting Denoising-Similarity for Diffusion Models

MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices

Trainer: an Energy-Efficient Edge-Device Training Processor Supporting Dynamic Weight Pruning

MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization

STICKER: an Energy-Efficient Multi-Sparsity Compatible Accelerator for Convolutional Neural Networks in 65-Nm CMOS

BudgetFusion: Perceptually-Guided Adaptive Diffusion Models

Accelerating Text-to-Image Editing via Cache-Enabled Sparse Diffusion Inference

Efficiency Meets Fidelity: A Novel Quantization Framework for Stable Diffusion

STICKER-T: an Energy-Efficient Neural Network Processor Using Block-Circulant Algorithm and Unified Frequency-Domain Acceleration

A 11.6μ W Computing-on-Memory-Boundary Keyword Spotting Processor with Joint MFCC-CNN Ternary Quantization

An Energy-Efficient Convolutional Neural Network Processor Architecture Based on a Systolic Array