Abstract:Weight quantization is well adapted to cope with the ever-growing complexity of the deep neural network (DNN) model. Diversified quantization schemes lead to diverse quantized bit width and formats of the weights, thereby, subject to different hardware implementations. Such variety prevents a general NPU to leverage different quantization schemes to gain performance and energy efficiency. More importantly, a trend of quantization diversity emerges that applies multiple quantization schemes to different fine-grained structures (e.g., a layer or a channel of weight) of a DNN. Therefore, a general architecture is desired to exploit varied quantization schemes. The crossbar-based processing-in-memory (PIM) architecture, a promising DNN accelerator, is well known for its highly efficient matrix-vector multiplication. However, PIM suffers from the inflexible intracrossbar data path because the weight is stationary on the crossbar and binds to the “add” operation along the bitline. Therefore, many nonuniform quantization methods must rollback the quantization before mapping the weights onto the crossbar. Counterintuitively, this article discovers a unique opportunity of the PIM architecture to exploit varied quantization schemes. We first transform the quantization diversity problem into a consistency problem by aligning the bit with the same magnitude along the same bitline of the crossbar. Consequently, such naive weight mapping causes many square hollows of idle PIM cells. We then propose a novel spatial mapping to exempt these “hollow” crossbar from the intercrossbar data path. To further squeeze the weights on fewer crossbars, we decouple the intracrossbar data path from the hardware bitline by a novel temporal scheduling, so that bits with different magnitudes can be placed on cells along the same bitline. Finally, the proposed IVQ includes a temporal pipeline to avoid the introduced stalling cycles, and a data flow with delicate control mechanisms for the new intra and intercrossbar data paths. Putting all together, IVQ achieves $19.7\times $ , $10.7\times $ , $4.7\times \sim 63.4\times $ , $91.7\times $ speedup, and $17.7\times $ , $5.1\times $ , $5.7\times \sim 68.1\times $ , $541\times $ energy savings over two PIM accelerators (ISAAC and CASCADE), two customized quantization accelerators (based on ASIC and FPGA), and NVIDIA RTX 2080 GPU, respectively.

PQ-PIM: A Pruning–quantization Joint Optimization Framework for ReRAM-based Processing-in-memory DNN Accelerator

APQ: Automated DNN Pruning and Quantization for ReRAM-Based Accelerators

Block-Wise Mixed-Precision Quantization: Enabling High Efficiency for Practical ReRAM-based DNN Accelerators

Runtime Row/Column Activation Pruning for ReRAM-based Processing-in-Memory DNN Accelerators

PattPIM: A Practical ReRAM-Based DNN Accelerator by Reusing Weight Pattern Repetitions

A Practical Highly Paralleled ReRAM-Based DNN Accelerator by Reusing Weight Pattern Repetitions

Re2PIM

Mixed Precision Quantization for ReRAM-based DNN Inference Accelerators

PRAP-PIM: A Weight Pattern Reusing Aware Pruning Method for ReRAM-based PIM DNN Accelerators

ERA-BS: Boosting the Efficiency of ReRAM-based PIM Accelerator with Fine-Grained Bit-Level Sparsity

Learning the sparsity for ReRAM - mapping and pruning sparse neural network for ReRAM based accelerator.

Towards ReRAM-based Accelerator:An Energy-efficient NN Model Compression Framework

A Flexible Yet Efficient DNN Pruning Approach for Crossbar-Based Processing-in-Memory Architectures.

ReRAM-Sharing: Fine-Grained Weight Sharing for ReRAM-Based Deep Neural Network Accelerator.

AUTO-PRUNE

Bit-Transformer: Transforming Bit-level Sparsity into Higher Preformance in ReRAM-based Accelerator

SDP: Co-Designing Algorithm, Dataflow, and Architecture for In-SRAM Sparse NN Acceleration

APIM: An Antiferromagnetic MRAM-Based Processing-In-Memory System for Efficient Bit-level Operations of Quantized Convolutional Neural Networks

SoBS-X: Squeeze-Out Bit Sparsity for ReRAM-Crossbar-Based Neural Network Accelerator.

PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory.

IVQ: In-Memory Acceleration of DNN Inference Exploiting Varied Quantization