Abstract:Computing-in-memory (CIM) is an attractive approach for energy-efficient neural network (NN) processors. Attention mechanisms shows great performance in NLP and CV by capturing contextual knowledge from the entire tokens (X). The attention mechanism is essentially a content-based similarity search by computing attention probabilities (P) and final attention results (Att). For P, first, the query (Q) and the key (K) are computed by X and weight matrices $(\text{W}_{Q}, \text{W}_{K})$ respectively. Then, Q is multiplied by $\text{K}^{T}($ QxK $^{T})$ for the attention score (S). Finally, P is computed by Softmax-activating S. For Att, V is obtained by multiplying X and a weight matrix $(\text{W}_{V})$, and then, Att is computed by multiplying P and $\text{V}(\text{P}\times \text{V})$. As shown in Fig. 1, previous CIM chips face several challenges for P and Att computing [1, 2]. First, CIM shows great advantages only if multiplying a fixed matrix. But in P and Att computing, $\text{W}_{Q}$, $\text{W}_{K}, \text{W}_{V}$ are fixed, involving in 15% computations in Longformer. Thus, most computations mismatch the traditional paradigm of CIM. Second, in QxK $^{T}$, 34.7% of the computations are redundant as many near-zeros from Softmax become zero after quantization. Third, CIM macros perform inner product naturally. For Att, V is generated row-by-row (i.e., token-wise), but in $\text{P}\times \text{V}$, a column of V is left-multiplied by P (i.e., tokenacross). Only when V has been fully generated, can CIM macros perform $\text{P}\times \text{V}$. Thus, Att computing cannot be fully pipelined, reducing system throughput. This paper presents a processor named AttCIM that solves these issues with three key features: 1) A correlative CIM ring (CRCIMR) to avoid it to load dynamically generated matrices. 2) A Softmax-based speculate unit (SSU) to eliminate redundant computations in $\text{Q}\times \text{K}^{T}$. 3) A dataflow-reshaped digital-assisted CIM-array (DRCIMA) to achieve fully pipelined computations in $\text{P}\times \text{V}$.

16.4 TensorCIM: A 28nm 3.7nj/gather and 8.3TFLOPS/W FP32 Digital-CIM Tensor Processor for MCM-CIM-Based Beyond-NN Acceleration

TensorCIM: A 28nm 3.7nJ/Gather and 8.3TFLOPS/W FP32 Digital-CIM Tensor Processor for MCM-CIM-Based Beyond-NN Acceleration.

TensorCIM: Digital Computing-in-Memory Tensor Processor with Multichip-Module-Based Architecture for Beyond-NN Acceleration

A 28nm 16.9-300TOPS/W Computing-in-Memory Processor Supporting Floating-Point NN Inference/Training with Intensive-CIM Sparse-Digital Architecture

A 2.75-to-75.9tops/w Computing-in-Memory NN Processor Supporting Set-Associate Block-Wise Zero Skipping and Ping-Pong CIM with Simultaneous Computation and Weight Updating.

TT@CIM: A Tensor-Train In-Memory-Computing Processor Using Bit-Level-Sparsity Optimization and Variable Precision Quantization

A Digital SRAM Computing-in-Memory Design Utilizing Activation Unstructured Sparsity for High-Efficient DNN Inference

A 28nm 29.2TFLOPS/W BF16 and 36.5TOPS/W INT8 Reconfigurable Digital CIM Processor with Unified FP/INT Pipeline and Bitwise In-Memory Booth Multiplication for Cloud Deep Learning Acceleration

Quartet: A 22nm 0.09mj/lnference Digital Compute-in-Memory Versatile AI Accelerator with Heterogeneous Tensor Engines and Off-Chip-Less Dataflow

DCIM-GCN: Digital Computing-in-Memory Accelerator for Graph Convolutional Network

A 28nm 57.6TOPS/W Attention-based NN Processor with Correlative Computing-in-Memory Ring and Dataflow-reshaped Digital-assisted Computing-in-Memory Array

An Event-Based Digital Compute-In-Memory Accelerator with Flexible Operand Resolution and Layer-Wise Weight/Output Stationarity

A 28-Nm 36 Kb SRAM CIM Engine with 0.173 $\mu $m$^{2}$ 4T1T Cell and Self-Load-0 Weight Update for AI Inference and Training Applications

A 28nm 15.59µJ/Token Full-Digital Bitline-Transpose CIM-Based Sparse Transformer Accelerator with Pipeline/Parallel Reconfigurable Modes

15.4 A 5.99-to-691.1TOPS/W Tensor-Train In-Memory-Computing Processor Using Bit-Level-Sparsity-Based Optimization and Variable-Precision Quantization

Cambricon-M: A Fibonacci-Coded Charge-Domain SRAM-Based CIM Accelerator for DNN Inference

A 28-Nm 28.8-TOPS/W Attention-Based NN Processor with Correlative CIM Ring Architecture and Dataflow-Reshaped Digital-Assisted CIM Array

GCFP-ACIM: A 40nm 4.74TFLOPS/W General Complex Float-Point Analog Compute-in-Memory with Adaptive Power-Saving for HDR Signal Processing Applications

14.3 A 65nm Computing-in-Memory-Based CNN Processor with 2.9-to-35.8 TOPS/W System Energy Efficiency Using Dynamic-Sparsity Performance-Scaling Architecture and Energy …

An Emerging NVM CIM Accelerator with Shared-Path Transpose Read and Bit-Interleaving Weight Storage for Efficient On-Chip Training in Edge Devices