Abstract:Transformer models shows state-of-the-art results in natural language processing and computer vision, leveraging a multi-headed self-attention mechanism. In each head, the operation is defined as $\text{Attn}=\text{Softmax}(\mathrm{Q}\cdot \mathrm{K}^{\top})\cdot \mathrm{V}$ , where $\mathrm{Q}=\mathrm{X}\cdot \mathrm{W}_{\mathrm{Q}},\ \mathrm{K}=\mathrm{X}\cdot \mathrm{W}_{\mathrm{K}}$ and $\mathrm{V}=\mathrm{X}\cdot \mathrm{W}_{\mathrm{V}}$ are linear transformations for Query (Q), Key (K) and Value (V) with weight $\mathrm{W}_{\mathrm{Q}}$ , $\mathrm{W}_{\mathrm{K}}$ and $\mathrm{W}_{\mathrm{V}}$ , respectively. $\mathrm{Q}\cdot \mathrm{K}^{\top}$ is responsible to learn relevance scores between tokens (X). Previous CIM chips faces new challenges within and across attention heads (Fig. 1): (1) Computing-in-memory (CIM) shows great advantages only if fixing pre-trained weights. However, since Q and K are both generated at runtime, loading K in CIM macros consumes more energy in the computing of $\mathrm{Q}\cdot \mathrm{K}^{\top}$ . (2) A CIM macro performs multiply-accumulate (MAC) operations with bit-serial inputs, so the latency is determined by input precisions. The attention scores are normalized by Softmax to Probability (P). Most elements of P are close to or exactly zero. For the P.V of Bert-T, only 10% elements with large effective bit-width (EBW) exacerbate the computing latency of the rest 90% of inputs. (3) On-chip SRAM-CIM cannot hold weights of all heads. As completing the computation of the stored weights, all macros need to reload new weights, causing a significant performance loss. This work designs a CIM processor for Transformer, called CIMFormer, with three features to solve above challenges: (1) A token-slimmed $\mathrm{Q}\cdot \mathrm{K}^{\top}$ reformulation is proposed to reduce the loading of intermediate data and redundant computations in CIM macros. Besides, a column-partitioned $\mathrm{X}\vert \mathrm{W}- \text{CIM}$ with flexible set-aggregate adder tree (Flex-AT) is designed to efficiently match the reformulated $\mathrm{Q}\cdot \mathrm{K}^{\top}$ with high utilization. (2) A principal possibility gather-scatter scheduler (PPGSS) collects the elements with large EBWs, called principal possibility elements (PPEs), as simultaneous activations for a CIM macro, reducing the compute latency of the rest activations with small EBWs. (3) A systolic CIM array supports bidirectional matrix multiplications with macro-interleaved broadcast for activations. In the systolic CIM macro array, array-level weight reloading is divided into macro-level and hidden in macro-level systolic computing.

CIMFormer: A 38.9Tops/w-8B Systolic CIM-Array Based Transformer Processor with Token-Slimmed Attention Reformulating and Principal Possibility Gathering

CIMFormer: A Systolic CIM-Array-Based Transformer Accelerator with Token-Pruning-Aware Attention Reformulating and Principal Possibility Gathering

A Low-Power In-Memory Multiplication and Accumulation Array with Modified Radix-4 Input and Canonical Signed Digit Weights

A 28-Nm 28.8-TOPS/W Attention-Based NN Processor with Correlative CIM Ring Architecture and Dataflow-Reshaped Digital-Assisted CIM Array

TranCIM: Full-Digital Bitline-Transpose CIM-based Sparse Transformer Accelerator With Pipeline/Parallel Reconfigurable Modes

A 28nm 57.6TOPS/W Attention-based NN Processor with Correlative Computing-in-Memory Ring and Dataflow-reshaped Digital-assisted Computing-in-Memory Array

A 28nm 15.59µJ/Token Full-Digital Bitline-Transpose CIM-Based Sparse Transformer Accelerator with Pipeline/Parallel Reconfigurable Modes

34.3 A 22nm 64kb Lightning-Like Hybrid Computing-in-Memory Macro with a Compressed Adder Tree and Analog-Storage Quantizers for Transformer and CNNs.

MulTCIM: Digital Computing-in-Memory-Based Multimodal Transformer Accelerator With Attention-Token-Bit Hybrid Sparsity

16.1 MuITCIM: A 28nm <tex>$2.24 \mu\mathrm{J}$</tex>/Token Attention-Token-Bit Hybrid Sparse Digital CIM-Based Accelerator for Multimodal Transformers

16.1 MuITCIM: A 28nm $2.24 \mu\mathrm{j}$/token Attention-Token-Bit Hybrid Sparse Digital CIM-Based Accelerator for Multimodal Transformers

TensorCIM: Digital Computing-in-Memory Tensor Processor with Multichip-Module-Based Architecture for Beyond-NN Acceleration

S2D-CIM: A 22nm 128kb Systolic Digital Compute-in-Memory Macro with Domino Data Path for Flexible Vector Operation and 2-D Weight Update in Edge AI Applications

TT@CIM: A Tensor-Train In-Memory-Computing Processor Using Bit-Level-Sparsity Optimization and Variable Precision Quantization

SPCIM: Sparsity-Balanced Practical CIM Accelerator with Optimized Spatial-Temporal Multi-Macro Utilization

SSM-CIM: an Efficient CIM Macro Featuring Single-Step Multi-bit MAC Computation for CNN Edge Inference

An Edram Based Computing-in-Memory Macro with Full-Valid-Storage and Channel-Wise-Parallelism for Depthwise Neural Network

A 28nm 32kb SRAM Computing-in-Memory Macro with Hierarchical Capacity Attenuator and Input Sparsity-Optimized ADC for 4b Mac Operation

14.3 A 65nm Computing-in-Memory-Based CNN Processor with 2.9-to-35.8tops/w System Energy Efficiency Using Dynamic-Sparsity Performance-Scaling Architecture and Energy-Efficient Inter/Intra-Macro Data Reuse.

A Twin-8T SRAM Computation-in-Memory Unit-Macro for Multibit CNN-Based AI Edge Processors