Abstract:Transformer models have achieved impressive performance in various artificial intelligence (AI) applications. However, the high cost of computation and memory footprint make its inference inefficient. Although digital compute-in-memory (CIM) is a promising hardware architecture with high accuracy, Transformer’s attention mechanism raises three challenges in the access and computation of CIM: 1) the attention computation involving Query and Key results in massive data movement and under-utilization in CIM macros; 2) the attention computation involving Possibility and Value exhibits plenty of dynamic bit-level sparsity, resulting in redundant bit-serial CIM operations; and 3) the restricted data reload bandwidth in CIM macros results in a significant decrease in performance for large Transformer models. To address these challenges, we design a CIM accelerator called CIM Transformer (CIMFormer) with three corresponding features. First, the token-pruning-aware attention reformulation (TPAR) is a technique that adjusts attention computations according to the token-pruning ratio. This reformulation reduces the real-time access to and under-utilization of CIM macros. Second, the principal possibility gather-scatter scheduler (PPGSS) gathers the possibilities with greater effective bit-width as concurrent inputs to CIM macros, enhancing the efficiency of bit-serial CIM operations. Third, the systolic X $\mid$ W-CIM macro array efficiently handles the execution of large Transformer models that exceed the storage capacity of the on-chip CIM macros. Fabricated in a 28-nm technology, CIMFormer achieves a peak energy efficiency of 15.71 TOPS/W, with an over 1.46 $\times$ improvement compared with the state-of-the-art Transformer accelerator at an equivalent situation.

A 28nm 4.35tops/mm2 Transformer Accelerator with Basis-vector Based Ultra Storage Compression, Decomposed Computation and Unified LUT-Assisted Cores

A 28nm 77.35TOPS/W Similar Vectors Traceable Transformer Processor with Principal-Component-Prior Speculating and Dynamic Bit-wise Stationary Computing.

DaDianNao: A Machine-Learning Supercomputer

A 28-Nm 28.8-TOPS/W Attention-Based NN Processor with Correlative CIM Ring Architecture and Dataflow-Reshaped Digital-Assisted CIM Array

H3D-Transformer: A Heterogeneous 3D (H3D) Computing Platform for Transformer Model Acceleration on Edge Devices

A 22nm 54.94TFLOPS/W Transformer Fine-Tuning Processor with Exponent-Stationary Re-Computing, Aggressive Linear Fitting, and Logarithmic Domain Multiplicating

7.2 A 12nm Programmable Convolution-Efficient Neural-Processing-Unit Chip Achieving 825TOPS

A 28nm 15.59µJ/Token Full-Digital Bitline-Transpose CIM-Based Sparse Transformer Accelerator with Pipeline/Parallel Reconfigurable Modes

A 28nm 27.5TOPS/W Approximate-Computing-Based Transformer Processor with Asymptotic Sparsity Speculating and Out-of-Order Computing.

CIMFormer: A Systolic CIM-Array-Based Transformer Accelerator with Token-Pruning-Aware Attention Reformulating and Principal Possibility Gathering

A 28nm 49.7TOPS/W Sparse Transformer Processor with Random-Projection-Based Speculation, Multi-Stationary Dataflow, and Redundant Partial Product Elimination

Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization

BETA: Binarized Energy-Efficient Transformer Accelerator at the Edge

Quartet: A 22nm 0.09mj/lnference Digital Compute-in-Memory Versatile AI Accelerator with Heterogeneous Tensor Engines and Off-Chip-Less Dataflow

An Analog and Digital Hybrid Attention Accelerator for Transformers with Charge-based In-memory Computing

Monolithic 3D Integration of Analog RRAM-Based Fully Weight Stationary and Novel CFET 2T0c-Based Partially Weight Stationary for Accelerating Transformer

A Low-Power Sparse Convolutional Neural Network Accelerator with Pre-Encoding Radix-4 Booth Multiplier

7.5 A 65nm 0.39-to-140.3tops/w 1-to-12b Unified Neural Network Processor Using Block-Circulant-Enabled Transpose-Domain Acceleration with 8.1 × Higher TOPS/mm2and 6T HBST-TRAM-Based 2D Data-Reuse Architecture

A 28 nm 81 Kb 5995.3 TOPS/W 4T2R ReRAM Computing-in-Memory Accelerator With Voltage-to-Time-to-Digital Based Output

TranCIM: Full-Digital Bitline-Transpose CIM-based Sparse Transformer Accelerator With Pipeline/Parallel Reconfigurable Modes

7.5 A 65nm 0.39-to-140.3 TOPS/W 1-to-12b Unified Neural Network Processor Using Block-Circulant-Enabled Transpose-Domain Acceleration with 8.1× Higher TOPS/mm 2 and 6T HBST …