Abstract:Bit-serial Processing-In-Memory (PIM) is an attractive paradigm for accelerator architectures, for parallel workloads such as Deep Learning (DL), because of its capability to achieve massive data parallelism at a low area overhead and provide orders-of-magnitude data movement savings by moving computational resources closer to the data. While many PIM architectures have been proposed, improvements are needed in communicating intermediate results to consumer kernels, for communication between tiles at scale, for reduction operations, and for efficiently performing bit-serial operations with constants. We present PIMSAB, a scalable architecture that provides spatially aware communication network for efficient intra-tile and inter-tile data movement and provides efficient computation support for generally inefficient bit-serial compute patterns. Our architecture consists of a massive hierarchical array of compute-enabled SRAMs (CRAMs) and is codesigned with a compiler to achieve high utilization. The key novelties of our architecture are: (1) providing efficient support for spatially-aware communication by providing local H-tree network for reductions, by adding explicit hardware for shuffling operands, and by deploying systolic broadcasting, and (2) taking advantage of the divisible nature of bit-serial computations through adaptive precision, bit-slicing and efficient handling of constant operations. When compared against a similarly provisioned modern Tensor Core GPU (NVIDIA A100), across common DL kernels and an end-to-end DL network (Resnet18), PIMSAB outperforms the GPU by 3x, and reduces energy by 4.2x. We compare PIMSAB with similarly provisioned state-of-the-art SRAM PIM (Duality Cache) and DRAM PIM (SIMDRAM) and observe a speedup of 3.7x and 3.88x respectively.

Improving overall parallelism in AES accelerator using BRAM and multiple input blocks

A design framework for processing-in-memory accelerator

BRAMAC: Compute-in-BRAM Architectures for Multiply-Accumulate on FPGAs

M4BRAM: Mixed-Precision Matrix-Matrix Multiplication in FPGA Block RAMs

Compute RAMs: Adaptable Compute and Storage Blocks for DL-Optimized FPGAs

Parallel Implementation of AES on 2.5D Multicore Platform with Hardware and Software Co-Design.

Design of a Low-Power Cryptographic Accelerator Under Advanced Encryption Standard

A Small-area Design of High Throughput AES Coprocessor

Improving FPGA-based Async-logic AES Accelerator with the Integration of Sync-logic Block RAMs

High Throughput Aes Encryption/Decryption With Efficient Reordering And Merging Techniques

Refine and Recycle: A Method to Increase Decompression Parallelism.

High-throughput and area-efficient fully-pipelined hashing cores using BRAM in FPGA

Simultaneous Accelerator Parallelization and Point-to-point Interconnect Insertion for Bus-Based Embedded SoCs

Multiscale Co-Design Analysis of Energy, Latency, Area, and Accuracy of a ReRAM Analog Neural Training Accelerator

An Effective Test Method for Block RAMs in Heterogeneous FPGAs Based on a Novel Partial Bitstream Relocation Technique

A Low-Latency DNN Accelerator Enabled by DFT-Based Convolution Execution Within Crossbar Arrays

A Low Area High Speed FPGA Implementation of AES Architecture for Cryptography Application

Adaptive design and implementation of automatic modulation recognition accelerator

A Design of a Fast Parallel-Pipelined Implementation of AES: Advanced Encryption Standard

A multithread AES accelerator for Cyber-Physical Systems

PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation