20.2 A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based Accelerator Exploiting Denoising-Similarity for Diffusion Models

Ruiqi Guo,Lei Wang,Xiaofeng Chen,Hao Sun,Zhiheng Yue,Yubin Qin,Huiming Han,Yang Wang,Fengbin Tu,Shaojun Wei,Yang Hu,Shouyi Yin
DOI: https://doi.org/10.1109/isscc49657.2024.10454308
2024-01-01
Abstract:Diffusion models (DMs) have emerged as a powerful category of generative models with record-breaking performance in image synthesis [1]. A noisy image created from pure Gaussian random variables needs to be denoised by iterative DMs to ensure generative quality. For DMs, quantizing activations to integers (INT) degrades image quality due to changes in activation distributions and the accumulation of quantization errors across iterations. A GPU (Nvidia A100) requires 2560 ms and 250 W to generate a $256 \times 256$ image through 50 iterations of a floating-point (FP) DM. Two adjacent denoised images bring similar visual effects, where the difference between pixels at the same position is very small. As a result, for two adjacent DMs, most input differences within the same layer $(\Delta IN)$ are consistently clustered within a narrow range, indicating that most $\Delta IN$ can be quantized as INT-data. The remaining $\Delta I N$ have relatively large values, whose distributions vary across iterative DMs. To ensure generative quality, a complete $\Delta IN$ tensor is divided into a dense INT tensor (INT-$\Delta$ IN) and a sparse FP tensor (FP-$\Delta$). Compute-in-memory (CIM) has shown high throughput and energy efficiency on INT multiply-and-accumulate (MAC), demonstrating its potential to process $\triangle IN$ efficiently. However, prior CIM chips face three challenges in speeding up on-device image generation to seconds with low power consumption [2–7]. First, conventional CIM chips perform MACs with bit-serial inputs, leading to significant runtime. A recent CIM chip incorporates an additional adder-tree to handle one more input bit, albeit at the cost of 85.4 % more power and 82.5 % more area [3]. Second, CIM chips cannot process FP data at high speed like INT data. They either requires repeated reads/writes to handle high-precision mantissas [4], or face lengthy alignment-cycle latencies [5]. Third, previous FP CIMs do not support identifying and utilizing stored sparse data, leading to redundant computations.
What problem does this paper attempt to address?