Abstract:Computing-in-memory (CIM) relieves the Von Neumann bottleneck by storing the weights of neural networks in memory arrays. However, two challenges still exist, hindering the efficient acceleration of convolutional neural networks (CNN) in artificial intelligence (AI) edge devices. Firstly, the activations for sliding window (SW) operations in CNN still bring high memory access pressure. This can be alleviated by increasing the SW parallelism, but simple array replication suffers from poor array utilization and large peripheral circuits overhead. Secondly, the partial sums from individual CIM arrays, which are usually accumulated to obtain the final sum, introduce large latency due to enormous shift-and-add operations. Moreover, high-resolution ADCs are also needed to reduce the quantization error of partial sums, further increasing the hardware costs. In this paper, a hardware-efficient CIM accelerator, ARBiS, is proposed with improved activation reusability and bit-scalable matrix-vector-multiplication (MVM) for CNN acceleration in AI edge applications. The cyclic-shift weight duplication exploits a third dimension of receptive field (RF) depth for SW weight mapping to reduce the memory accesses of activations, improving the array utilization. The parasitic-capacitance charge sharing is employed to realize high-precision analog MVM in order to reduce the ADC cost. Compared with conventional architectures, ARBiS with parallel processing of 9 SW operations achieves 56.6%~58.8% alleviation of memory access pressure. Meanwhile, ARBiS configured with 8-bit ADCs saves 92.53%~94.53% ADC energy consumption. An ARBiS accelerator is evaluated to realize a computational efficiency (CE) of 10.28 (10.43) TOPS/mm2, an energy efficiency (EE) of 91.19 (112.36) TOPS/W with 8-bit (4-bit) ADCs, achieving $11.4\sim 11.7\times $ ( $11.6\sim 11.8\times $ ), $1.1\sim 3.3\times $ ( $1.4\sim 4\times $ ) improvements over state-of-the-art works, respectively.

EF-CIM: an Endurance Friendly CIM Accelerator Using Embedded NVM with Bit-Aware Wear Leveling for Efficient Light-Weight On-Chip Training in Edge Devices

An 8-Bit in Resistive Memory Computing Core with Regulated Passive Neuron and Bitline Weight Mapping

An Emerging NVM CIM Accelerator with Shared-Path Transpose Read and Bit-Interleaving Weight Storage for Efficient On-Chip Training in Edge Devices

A Robust 8-Bit Non-Volatile Computing-in-Memory Core for Low-Power Parallel MAC Operations.

A Low-Power In-Memory Multiplication and Accumulation Array with Modified Radix-4 Input and Canonical Signed Digit Weights

A 2.75-to-75.9tops/w Computing-in-Memory NN Processor Supporting Set-Associate Block-Wise Zero Skipping and Ping-Pong CIM with Simultaneous Computation and Weight Updating.

Benchmark of the Compute-in-Memory-Based DNN Accelerator With Area Constraint

ARBiS: A Hardware-Efficient SRAM CIM CNN Accelerator with Cyclic-Shift Weight Duplication and Parasitic-Capacitance Charge Sharing for AI Edge Application

Light-CIM: A Lightweight ADC/DAC-Fewer RRAM CIM DNN Accelerator with Fully-Analog Tiles and Non-Ideality-Aware Algorithm for Consumer Electronics

A Heterogeneous Microprocessor for Intermittent AI Inference Using Nonvolatile-SRAM-based Compute-In-Memory

TensorCIM: Digital Computing-in-Memory Tensor Processor with Multichip-Module-Based Architecture for Beyond-NN Acceleration

Design of Computing-in-Memory (CIM) with Vertical Split-Gate Flash Memory for Deep Neural Network (DNN) Inference Accelerator

A 28-Nm 36 Kb SRAM CIM Engine with 0.173 $\mu $m$^{2}$ 4T1T Cell and Self-Load-0 Weight Update for AI Inference and Training Applications

S2D-CIM: A 22nm 128kb Systolic Digital Compute-in-Memory Macro with Domino Data Path for Flexible Vector Operation and 2-D Weight Update in Edge AI Applications

An Event-Based Digital Compute-In-Memory Accelerator with Flexible Operand Resolution and Layer-Wise Weight/Output Stationarity

16.4 TensorCIM: A 28nm 3.7nj/gather and 8.3TFLOPS/W FP32 Digital-CIM Tensor Processor for MCM-CIM-Based Beyond-NN Acceleration

Cambricon-M: A Fibonacci-Coded Charge-Domain SRAM-Based CIM Accelerator for DNN Inference

An 8-bit In Resistive Memory Computing Core with Regulated Passive Neuron and Bit Line Weight Mapping

TensorCIM: A 28nm 3.7nJ/Gather and 8.3TFLOPS/W FP32 Digital-CIM Tensor Processor for MCM-CIM-Based Beyond-NN Acceleration.

An Edram Based Computing-in-Memory Macro with Full-Valid-Storage and Channel-Wise-Parallelism for Depthwise Neural Network

CIMQ: A Hardware-Efficient Quantization Framework for Computing-In-Memory Based Neural Network Accelerators