Abstract:Weight quantization is well adapted to cope with the ever-growing complexity of the deep neural network (DNN) model. Diversified quantization schemes lead to diverse quantized bit width and formats of the weights, thereby, subject to different hardware implementations. Such variety prevents a general NPU to leverage different quantization schemes to gain performance and energy efficiency. More importantly, a trend of quantization diversity emerges that applies multiple quantization schemes to different fine-grained structures (e.g., a layer or a channel of weight) of a DNN. Therefore, a general architecture is desired to exploit varied quantization schemes. The crossbar-based processing-in-memory (PIM) architecture, a promising DNN accelerator, is well known for its highly efficient matrix-vector multiplication. However, PIM suffers from the inflexible intracrossbar data path because the weight is stationary on the crossbar and binds to the “add” operation along the bitline. Therefore, many nonuniform quantization methods must rollback the quantization before mapping the weights onto the crossbar. Counterintuitively, this article discovers a unique opportunity of the PIM architecture to exploit varied quantization schemes. We first transform the quantization diversity problem into a consistency problem by aligning the bit with the same magnitude along the same bitline of the crossbar. Consequently, such naive weight mapping causes many square hollows of idle PIM cells. We then propose a novel spatial mapping to exempt these “hollow” crossbar from the intercrossbar data path. To further squeeze the weights on fewer crossbars, we decouple the intracrossbar data path from the hardware bitline by a novel temporal scheduling, so that bits with different magnitudes can be placed on cells along the same bitline. Finally, the proposed IVQ includes a temporal pipeline to avoid the introduced stalling cycles, and a data flow with delicate control mechanisms for the new intra and intercrossbar data paths. Putting all together, IVQ achieves $19.7\times $ , $10.7\times $ , $4.7\times \sim 63.4\times $ , $91.7\times $ speedup, and $17.7\times $ , $5.1\times $ , $5.7\times \sim 68.1\times $ , $541\times $ energy savings over two PIM accelerators (ISAAC and CASCADE), two customized quantization accelerators (based on ASIC and FPGA), and NVIDIA RTX 2080 GPU, respectively.

131TOPS/W 8b ACIM Exploiting Weight-Embedded Auto-Accumulation and Supporting Symmetric Quantization Networks

A Low-Power In-Memory Multiplication and Accumulation Array with Modified Radix-4 Input and Canonical Signed Digit Weights

Improving the accuracy of neural networks in analog computing-in-memory systems by analog weight.

Improving the Accuracy of Neural Networks in Analog Computing-in-memory Systems by a Generalized Quantization Method

ZEBRA: A Zero-Bit Robust-Accumulation Compute-In-Memory Approach for Neural Network Acceleration Utilizing Different Bitwise Patterns

S2D-CIM: A 22nm 128kb Systolic Digital Compute-in-Memory Macro with Domino Data Path for Flexible Vector Operation and 2-D Weight Update in Edge AI Applications

CIMQ: A Hardware-Efficient Quantization Framework for Computing-In-Memory Based Neural Network Accelerators

CANET: Quantized Neural Network Inference With 8-bit Carry-Aware Accumulator

Searching for Low-Bit Weights in Quantized Neural Networks

34.3 A 22nm 64kb Lightning-Like Hybrid Computing-in-Memory Macro with a Compressed Adder Tree and Analog-Storage Quantizers for Transformer and CNNs.

A2Q+: Improving Accumulator-Aware Weight Quantization

A 16.38TOPS and 4.55POPS/W SRAM Computing-in-Memory Macro for Signed Operands Computation and Batch Normalization Implementation

Integer-Only CNNs with 4 Bit Weights and Bit-Shift Quantization Scales at Full-Precision Accuracy

IVQ: In-Memory Acceleration of DNN Inference Exploiting Varied Quantization

Toggle Rate Aware Quantization Model Based on Digital Floating-Point Computing-in-Memory Architecture

Weight Quantization Method for Spiking Neural Networks and Analysis of Adversarial Robustness

A 2.75-to-75.9tops/w Computing-in-Memory NN Processor Supporting Set-Associate Block-Wise Zero Skipping and Ping-Pong CIM with Simultaneous Computation and Weight Updating.

Scale-CIM: Precision-Scalable Computing-in-Memory for Energy-Efficient Quantized Neural Networks

A Communication-Aware DNN Accelerator on ImageNet Using In-Memory Entry-Counting Based Algorithm-Circuit-Architecture Co-Design in 65-nm CMOS

VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference