Abstract:The recent advances in Artificial Intelligence (AI) achieving “better-than-human” accuracy in a variety of tasks such as image classification and the game of Go have come at the cost of exponential increase in the size of artificial neural networks. This has lead to AI hardware solutions becoming severely memory-bound and scrambling to keep-up with the ever increasing “von Neumann bottleneck”. Processing-in-Memory (PiM) architectures offer an excellent solution to ease the von Neumann bottleneck by embedding compute capabilities inside the memory and reducing the data traffic between the memory and the processor. But PiM accelerators break the standard von Neumann programming model by fusing memory and compute operations together which impedes their integration in the standard computing stack. There is an urgent requirement for system-level solutions to take full advantage of PiM accelerators for end-to-end acceleration of AI applications. This article presents AI-PiM as a solution to bridge this research gap. AI-PiM proposes a hardware, ISA and software co-design methodology which allows integration of PiM accelerators in the RISC-V processor pipeline as functional execution units. AI-PiM also extends the RISC-V ISA with custom instructions which directly target the PiM functional units resulting in their tight integration with the processor. This tight integration is especially important for edge AI devices which need to process both AI and non-AI tasks on the same hardware due to area, power, size and cost constraints. AI-PiM ISA extensions expose the PiM hardware functionality to software programmers allowing efficient mapping of applications to the PiM hardware. AI-PiM adds support for custom ISA extensions to the complete software stack including compiler, assembler, linker, simulator and profiler to ensure programmability and evaluation with popular AI domain-specific languages and frameworks like TensorFlow, PyTorch, MXNet, Keras etc. AI-PiM improves the performance for vector-matrix multiplication (VMM) kernel by 17.63x and provides a mean speed-up of 2.74x for MLPerf Tiny benchmark compared to RV64IMC RISC-V baseline. AI-PiM also speeds-up MLPerf Tiny benchmark inference cycles by 2.45x (average) compared to state-of-the-art Arm Cortex-A72 processor.

Instruction Set Architecture (ISA) for Processing-in-Memory DNN Accelerators

PIMSIM-NN: An ISA-based Simulation Framework for Processing-in-Memory Accelerators

A design framework for processing-in-memory accelerator

PIMCOMP: An End-to-End DNN Compiler for Processing-In-Memory Accelerators

Neural-PIM: Efficient Processing-In-Memory with Neural Approximation of Peripherals

DyPIM: Dynamic-Inference-Enabled Processing - In-Memory Accelerator

NicePIM: Design Space Exploration for Processing-In-Memory DNN Accelerators with 3D-Stacked-DRAM

PIMSAB: A P Rocessing- I N- M Emory System with S Patially- A Ware Communication and B It-Serial-aware Computation

DAISM: Digital Approximate In-SRAM Multiplier-based Accelerator for DNN Training and Inference

PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation

PIM-HLS: An Automatic Hardware Generation Tool for Heterogeneous Processing-In-Memory-based Neural Network Accelerators.

PIMulator-NN: an Event-Driven, Cross-level Simulation Framework for Processing-In-Memory Based Neural Network Accelerators

AI-PiM—Extending the RISC-V processor with Processing-in-Memory functional units for AI inference at the edge of IoT

SDP: Co-Designing Algorithm, Dataflow, and Architecture for In-SRAM Sparse NN Acceleration

pPIM: A Programmable Processor-in-Memory Architecture With Precision-Scaling for Deep Learning

Shared-PIM: Enabling Concurrent Computation and Data Flow for Faster Processing-in-DRAM

Generalized Ping-Pong: Off-Chip Memory Bandwidth Centric Pipelining Strategy for Processing-In-Memory Accelerators

Functionality-Based Processing-in-Memory Accelerator for Deep Convolutional Neural Networks

Heterogeneous Memory Architecture Accommodating Processing-in-Memory on SoC for AIoT Applications

PIMCoSim: Hardware/Software Co-Simulator for Exploring Processing-in-Memory Architectures

Accelerating Deep Neural Networks in Processing-in-Memory Platforms: Analog or Digital Approach?