Abstract:Quantized neural networks (QNNs), which perform multiply-accumulate (MAC) operations with low-precision weights or activations, have been widely exploited to reduce energy consumption. QNNs usually have a trade-off between energy consumption and accuracy depending on the quantized precision, so that it is necessary to select an appropriate precision for energy efficiency. Nevertheless, the conventional hardware accelerators such as Google TPU are typically designed and optimized for a specific precision (e.g., 8-bit), which may degrade energy efficiency for other precisions. Though an analog-based computing-in-memory (CIM) technology supporting variable precision has been proposed to improve energy efficiency, its implementation requires extremely large and power-consuming analog-to-digital converters (ADCs). In this paper, we propose Scale-CIM , a precision-scalable CIM architecture which supports MAC operations based on digital computations (not analog computations). Scale-CIM performs binary MAC operations with high parallelism, by executing digital-based multiplication operations in the CIM array and accumulation operations in the peripheral logic. In addition, Scale-CIM supports multi-bit MAC operations without ADCs, based on the binary MAC operations and shift operations depending on the precision. Since Scale-CIM fully utilizes the CIM array for various quantized precisions (not for a specific precision), it achieves high compute-throughput. Consequently, Scale-CIM enables precision-scalable CIM-based MAC operations with high parallelism. Our simulation results show that Scale-CIM achieves 1.5∼15.8 × speedup and reduces system energy consumption by 53.7∼95.7% across different quantized precisions, compared to the state-of-the-art precision-scalable accelerator.

ZEBRA: A Zero-Bit Robust-Accumulation Compute-In-Memory Approach for Neural Network Acceleration Utilizing Different Bitwise Patterns

A Robust 8-Bit Non-Volatile Computing-in-Memory Core for Low-Power Parallel MAC Operations.

A Low-Power In-Memory Multiplication and Accumulation Array with Modified Radix-4 Input and Canonical Signed Digit Weights

An 8-Bit in Resistive Memory Computing Core with Regulated Passive Neuron and Bitline Weight Mapping

CIMQ: A Hardware-Efficient Quantization Framework for Computing-In-Memory Based Neural Network Accelerators

BASER: Bit-wise Approximate Compressor Configurable In-SRAM-computing for Energy-Efficient Neural Network Acceleration with Data-aware Weight Remapping Method

131TOPS/W 8b ACIM Exploiting Weight-Embedded Auto-Accumulation and Supporting Symmetric Quantization Networks

BR-CIM: an Efficient Binary Representation Computation-In-Memory Design

A 2.75-to-75.9tops/w Computing-in-Memory NN Processor Supporting Set-Associate Block-Wise Zero Skipping and Ping-Pong CIM with Simultaneous Computation and Weight Updating.

Toggle Rate Aware Quantization Model Based on Digital Floating-Point Computing-in-Memory Architecture

Scale-CIM: Precision-Scalable Computing-in-Memory for Energy-Efficient Quantized Neural Networks

A Low-Power Charge-Domain Bit-Scalable Readout System for Fully-Parallel Computing-in-Memory Accelerators

Improving the accuracy of neural networks in analog computing-in-memory systems by analog weight.

Cambricon-M: A Fibonacci-Coded Charge-Domain SRAM-Based CIM Accelerator for DNN Inference

Memristor Based Mixed-Accuracy Computation-in-Memory System.

A Multiply-Less Approximate SRAM Compute-In-Memory Macro for Neural-Network Inference

CIMulator: A Comprehensive Simulation Platform for Computing-In-Memory Circuit Macros with Low Bit-Width and Real Memory Materials

Weight and Multiply-Accumulation Sparsity-Aware Non-Volatile Computing-in-Memory System

34.3 A 22nm 64kb Lightning-Like Hybrid Computing-in-Memory Macro with a Compressed Adder Tree and Analog-Storage Quantizers for Transformer and CNNs.

EBSP: evolving bit sparsity patterns for hardware-friendly inference of quantized deep neural networks

EF-CIM: an Endurance Friendly CIM Accelerator Using Embedded NVM with Bit-Aware Wear Leveling for Efficient Light-Weight On-Chip Training in Edge Devices