Abstract:Bit-serial Processing-In-Memory (PIM) is an attractive paradigm for accelerator architectures, for parallel workloads such as Deep Learning (DL), because of its capability to achieve massive data parallelism at a low area overhead and provide orders-of-magnitude data movement savings by moving computational resources closer to the data. While many PIM architectures have been proposed, improvements are needed in communicating intermediate results to consumer kernels, for communication between tiles at scale, for reduction operations, and for efficiently performing bit-serial operations with constants. We present PIMSAB, a scalable architecture that provides a spatially aware communication network for efficient intra-tile and inter-tile data movement and provides efficient computation support for generally inefficient bit-serial compute patterns. Our architecture consists of a massive hierarchical array of compute-enabled SRAMs (CRAMs), which is codesigned with a compiler to achieve high utilization. The key novelties of our architecture are (1) in providing efficient support for spatially-aware communication by providing local H-tree network for reductions, by adding explicit hardware for shuffling operands, and by deploying systolic broadcasting, as well as (2) by taking advantage of the divisible nature of bit-serial computations through adaptive precision and efficient handling of constant operations. These innovations are integrated into a tensor expressions-based programming framework (including a compiler for easy programmability) that enables simple programmer control of optimizations for mapping programs into massively parallel binaries for millions of PIM processing elements. When compared against a similarly provisioned modern Tensor Core GPU (NVIDIA A100), across common DL kernels and end-to-end DL networks (Resnet18 and BERT), PIMSAB outperforms the GPU by 4.80 ×, and reduces energy by 3.76 ×. We compare PIMSAB with similarly provisioned state-of-the-art SRAM PIM (Duality Cache) and DRAM PIM (SIMDRAM), and observe a speedup of 3.7 × and 3.88 × respectively.

A Collaborative PIM Computing Optimization Framework for Multi-Tenant DNN

A design framework for processing-in-memory accelerator

PIMSAB: A P Rocessing- I N- M Emory System with S Patially- A Ware Communication and B It-Serial-aware Computation

NicePIM: Design Space Exploration for Processing-In-Memory DNN Accelerators with 3D-Stacked-DRAM

Re2PIM

Dataflow-Aware PIM-Enabled Manycore Architecture for Deep Learning Workloads

PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation

DDC-PIM: Efficient Algorithm/Architecture Co-design for Doubling Data Capacity of SRAM-based Processing-In-Memory

Static Scheduling of Weight Programming for DNN Acceleration with Resource Constrained PIM

A Practical Highly Paralleled ReRAM-Based DNN Accelerator by Reusing Weight Pattern Repetitions

PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory.

SEAL-lab Technical Report – No . 2015-001 ( April 29 , 2016 ) Processing-in-Memory in ReRAM-based Main Memory

Accelerating Neural Network Inference with Processing-in-DRAM: From the Edge to the Cloud

SEAL-lab Technical Report – No . 2015-001 ( November 30 , 2015 ) Processing-in-Memory in ReRAM-based Main Memory

DyPIM: Dynamic-Inference-Enabled Processing - In-Memory Accelerator

ReHy: A ReRAM-based Digital/Analog Hybrid PIM Architecture for Accelerating CNN Training

A Reconfigurable Computing-in-Memory Accelerator with Dynamic Group-Based Dataflow and Dual-Input Macro Designs

Neural-PIM: Efficient Processing-In-Memory with Neural Approximation of Peripherals

Generalized Ping-Pong: Off-Chip Memory Bandwidth Centric Pipelining Strategy for Processing-In-Memory Accelerators

Shared-PIM: Enabling Concurrent Computation and Data Flow for Faster Processing-in-DRAM

PQ-PIM: A Pruning–quantization Joint Optimization Framework for ReRAM-based Processing-in-memory DNN Accelerator