Abstract:Diffusion models (DMs) have emerged as a powerful category of generative models with record-breaking performance in image synthesis [1]. A noisy image created from pure Gaussian random variables needs to be denoised by iterative DMs to ensure generative quality. For DMs, quantizing activations to integers (INT) degrades image quality due to changes in activation distributions and the accumulation of quantization errors across iterations. A GPU (Nvidia A100) requires 2560 ms and 250 W to generate a $256 \times 256$ image through 50 iterations of a floating-point (FP) DM. Two adjacent denoised images bring similar visual effects, where the difference between pixels at the same position is very small. As a result, for two adjacent DMs, most input differences within the same layer $(\Delta IN)$ are consistently clustered within a narrow range, indicating that most $\Delta IN$ can be quantized as INT-data. The remaining $\Delta I N$ have relatively large values, whose distributions vary across iterative DMs. To ensure generative quality, a complete $\Delta IN$ tensor is divided into a dense INT tensor (INT-$\Delta$ IN) and a sparse FP tensor (FP-$\Delta$). Compute-in-memory (CIM) has shown high throughput and energy efficiency on INT multiply-and-accumulate (MAC), demonstrating its potential to process $\triangle IN$ efficiently. However, prior CIM chips face three challenges in speeding up on-device image generation to seconds with low power consumption [2–7]. First, conventional CIM chips perform MACs with bit-serial inputs, leading to significant runtime. A recent CIM chip incorporates an additional adder-tree to handle one more input bit, albeit at the cost of 85.4 % more power and 82.5 % more area [3]. Second, CIM chips cannot process FP data at high speed like INT data. They either requires repeated reads/writes to handle high-precision mantissas [4], or face lengthy alignment-cycle latencies [5]. Third, previous FP CIMs do not support identifying and utilizing stored sparse data, leading to redundant computations.

A 28nm 128TFLOPS/W Computing-In-Memory Engine Supporting One-Shot Floating-Point NN Inference and On-Device Fine-Tuning for Edge AI

A Robust 8-Bit Non-Volatile Computing-in-Memory Core for Low-Power Parallel MAC Operations.

A Low-Power In-Memory Multiplication and Accumulation Array with Modified Radix-4 Input and Canonical Signed Digit Weights

A 28nm 16.9-300TOPS/W Computing-in-Memory Processor Supporting Floating-Point NN Inference/Training with Intensive-CIM Sparse-Digital Architecture

A 28nm 29.2TFLOPS/W BF16 and 36.5TOPS/W INT8 Reconfigurable Digital CIM Processor with Unified FP/INT Pipeline and Bitwise In-Memory Booth Multiplication for Cloud Deep Learning Acceleration

A 28-nm Floating-Point Computing-in-Memory Processor Using Intensive-CIM Sparse-Digital Architecture

A 28nm 314.6TLFOPS/W Reconfigurable Floating-Point Analog Compute-In-Memory Macro with Exponent Approximation and Two-Stage Sharing TD-ADC

In-Memory Multi-Bit Multiplication and Accumulation (MAC) Using FeFET for Energy Efficient IoT

A 1.97 TFLOPS/W Configurable SRAM-Based Floating-Point Computation-in-Memory Macro for Energy-Efficient AI Chips.

A 28-nm 64-kb 31.6-TFLOPS/W Digital-Domain Floating-Point-Computing-Unit and Double-Bit 6T-SRAM Computing-in-Memory Macro for Floating-Point CNNs

A Reconfigurable Floating-Point Compute-In-Memory with Analog Exponent Pre-Processes

GCFP-ACIM: A 40nm 4.74TFLOPS/W General Complex Float-Point Analog Compute-in-Memory with Adaptive Power-Saving for HDR Signal Processing Applications

ReDCIM: Reconfigurable Digital Computing- in -Memory Processor with Unified FP/INT Pipeline for Cloud AI Acceleration

A 19.7 TFLOPS/W Multiply-less Logarithmic Floating-Point CIM Architecture with Error-Reduced Compensated Approximate Adder

An 8.8 TFLOPS/W Floating-Point RRAM-Based Compute-in-Memory Macro Using Low Latency Triangle-Style Mantissa Multiplication

AFPR-CIM: An Analog-Domain Floating-Point RRAM-based Compute-In-Memory Architecture with Dynamic Range Adaptive FP-ADC

Simulation of a Fully Digital Computing-in-Memory for Non-Volatile Memory for Artificial Intelligence Edge Applications

A Heterogeneous Microprocessor Based on All-Digital Compute-in-Memory for End-to-End AIoT Inference

20.2 A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based Accelerator Exploiting Denoising-Similarity for Diffusion Models

TT@CIM: A Tensor-Train In-Memory-Computing Processor Using Bit-Level-Sparsity Optimization and Variable Precision Quantization