Abstract:Over the past few years, on-device learning (ODL) has become an integral aspect of the success of edge devices that embrace machine learning (ML) since it plays a crucial role in restoring ML model accuracy when the edge environment changes. However, implementing ODL on battery-limited edge devices poses significant challenges due to the generation of large-size intermediate data during ML training and the frequent data movement between the processor and memory, resulting in substantial power consumption. To address this limitation, certain ML accelerators in edge devices have adopted a processing-in-memory (PIM) paradigm, integrating computing logic into memory. Nevertheless, these accelerators still face hurdles such as long latency caused by the lack of a pipelined approach in the training process, notable power and area overheads related to floating-point arithmetic, and incomplete handling of data sparsity during training. This article presents a high-throughput super-pipelined PIM accelerator, named SP-PIM, designed to overcome the limitations of existing PIM-based ODL accelerators. To this end, SP-PIM implements a holistic multi-level pipelining scheme based on local error prediction (EP), enhancing training speed by 7.31 $\times$ . In addition, SP-PIM introduces a local EP unit (LEPU), a lightweight circuit that performs accurate EP leveraging power-of-two (PoT) random weights. This strategy significantly reduces power-hungry external memory access (EMA) by 59.09%. Moreover, SP-PIM fully exploits sparsities in both activation and error data during training, facilitated by a highly optimized PIM macro design. Finally, the SP-PIM chip, fabricated using 28-nm CMOS technology, achieves a training speed of 8.81 epochs/s. It occupies a die area of 5.76 mm $^{2}$ and consumes between 6.91 and 433.25 mW at operating frequencies of 20–450 MHz with a supply voltage of 0.56–1.05 V. We demonstrate that it can successfully execute end-to-end ODL for the CIFAR10 and CIFAR100 datasets. Consequently, it achieves state-of-the-art area efficiency (560.6 GFLOPS/mm $^{2}$ ) and competitive power efficiency (22.4 TFLOPS/W), marking a 3.95 $\times$ higher figure-of-merit (area efficiency $\times$ power efficiency $\times$ capacity) than previous work. Furthermore, we implemented a cycle-level simulator using Python to investigate and validate the scalability of SP-PIM. By doing architectural experiments in various hardware configurations, we successfully verified that the core computing unit within SP-PIM possesses both scale-up and scale-out capabilities.

pPIM: A Programmable Processor-in-Memory Architecture With Precision-Scaling for Deep Learning

A design framework for processing-in-memory accelerator

ReApprox-PIM: Reconfigurable Approximate Look-Up-Table (LUT)-Based Processing-in-Memory (PIM) Machine Learning Accelerator

Neural-PIM: Efficient Processing-In-Memory with Neural Approximation of Peripherals

PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation

ConvPIM: Evaluating Digital Processing-in-Memory through Convolutional Neural Network Acceleration

PIM-HLS: An Automatic Hardware Generation Tool for Heterogeneous Processing-In-Memory-based Neural Network Accelerators.

PIM-AI: A Novel Architecture for High-Efficiency LLM Inference

SP-PIM: A Super-Pipelined Processing-In-Memory Accelerator With Local Error Prediction for Area/Energy-Efficient On-Device Learning

NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing

DyPIM: Dynamic-Inference-Enabled Processing - In-Memory Accelerator

Generalized Ping-Pong: Off-Chip Memory Bandwidth Centric Pipelining Strategy for Processing-In-Memory Accelerators

Shared-PIM: Enabling Concurrent Computation and Data Flow for Faster Processing-in-DRAM

Functionality-Based Processing-in-Memory Accelerator for Deep Convolutional Neural Networks

PIMulator-NN: an Event-Driven, Cross-level Simulation Framework for Processing-In-Memory Based Neural Network Accelerators

CMP-PIM: An Energy-Efficient Comparator-based Processing-In-Memory Neural Network Accelerator

ReHy: A ReRAM-based Digital/Analog Hybrid PIM Architecture for Accelerating CNN Training

VSPIM: SRAM Processing-in-Memory DNN Acceleration via Vector-Scalar Operations

NicePIM: Design Space Exploration for Processing-In-Memory DNN Accelerators with 3D-Stacked-DRAM

SEAL-lab Technical Report – No . 2015-001 ( April 29 , 2016 ) Processing-in-Memory in ReRAM-based Main Memory

Accelerating Neural Network Training with Processing-in-Memory GPU