Abstract:Resistive Random-Access-Memory (ReRAM) crossbar is one of the most promising neural network accelerators, thanks to its in-memory and in-situ analog computing abilities for Matrix Multiplication-and-Accumulations (MACs). The key limitations are: 1) the number of rows and columns of ReRAM cells for concurrent execution of MACs is constrained, resulting in limited in-memory computing throughput; 2) the cost of high-precision analog-to-digital (A/D) conversions that can offset the efficiency and performance benefits of ReRAM-based Process-In-Memory (PIM). Meanwhile, it is challenging to deploy Deep Neural Network (DNN) models with a large model size in the crossbar since the sparsity of DNNs cannot be effectively exploited in the crossbar structure, especially the sparsity in the activation. As a countermeasure, we develop a novel ReRAM-based PIM accelerator, namely ERA-BS, which pays attention to the correlation between the bit-level sparsity (in both weights and activations) and the performance of the ReRAM-based crossbar. We propose a superior bit-flip scheme combined with the exponent-based quantization, which can adaptively flip the bits of the mapped DNNs to release redundant space without sacrificing the accuracy much or incurring much hardware overhead. Meanwhile, we design an architecture that can integrate the techniques to shrink the crossbar footprint to be used massively. We further propose a dynamic activation sparsity exploitation scheme in conjunction with the tightly coupled structure nature of the crossbar, including crossbar-aware activation pruning and ancillary run-time hardware support. In such a way, we exploit fine-grained sparsity weights (static) and activations (dynamic), respectively, to improve performance while reducing the energy consumption of computation with negligible overheads. Our experiments on a wide variety of networks show that compared to the well-known ReRAM-based PIM accelerator like “ISAAC”, ERA-BS can achieve up to $43\times$ , $78\times$ , and $73\times$ in terms of energy efficiency, area-efficiency, and throughput, respectively. Compared to the state-of-the-art ReRAM-based design “PIM-Prune”, ERA-BS can also achieve $5.3\times$ energy efficiency, $7.2\times$ area efficiency, and $32\times$ performance gain with a similar or even higher accuracy.

CPSAA: Accelerating Sparse Attention using Crossbar-based Processing-In-Memory Architecture

Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation

SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices

Sanger: A Co-Design Framework for Enabling Sparse Attention Using Reconfigurable Architecture.

ERA-BS: Boosting the Efficiency of ReRAM-based PIM Accelerator with Fine-Grained Bit-Level Sparsity

S2-Attention: Hardware-Aware Context Sharding Among Attention Heads

PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation

SDP: Co-Designing Algorithm, Dataflow, and Architecture for In-SRAM Sparse NN Acceleration

Watt: A Write-Optimized RRAM-Based Accelerator for Attention.

Multilayer Dataflow: Orchestrate Butterfly Sparsity to Accelerate Attention Computation

COSA:Co-Operative Systolic Arrays for Multi-head Attention Mechanism in Neural Network Using Hybrid Data Reuse and Fusion Methodologies.

High-Performance Method and Architecture for Attention Computation in DNN Inference

BafSP: Co-Design of Compute SRAM and Bit-Aware Data Flip Mitigation with In-Memory Sparsity Detection for SpMM

SoBS-X: Squeeze-Out Bit Sparsity for ReRAM-Crossbar-Based Neural Network Accelerator.

COSA Plus: Enhanced Co-Operative Systolic Arrays for Attention Mechanism in Transformers

StoX-Net: Stochastic Processing of Partial Sums for Efficient In-Memory Computing DNN Accelerators

Towards Efficient SRAM-PIM Architecture Design by Exploiting Unstructured Bit-Level Sparsity

VSPIM: SRAM Processing-in-Memory DNN Acceleration via Vector-Scalar Operations

HARDSEA: Hybrid Analog-ReRAM Clustering and Digital-SRAM In-Memory Computing Accelerator for Dynamic Sparse Self-Attention in Transformer