Abstract:Processing-in-memory (PIM) architectures are emerging to reduce data movement in data-intensive applications. These architectures seek to exploit the same physical devices for both information storage and logic, thereby dwarfing the required data transfer and utilizing the full internal memory bandwidth. Whereas analog PIM utilizes the inherent connectivity of crossbar arrays for approximate matrix-vector multiplication in the analog domain, digital PIM architectures enable bitwise logic operations with massive parallelism across columns of data within memory arrays. Several recent works have extended the computational capabilities of digital PIM architectures towards the full-precision (single-precision floating-point) acceleration of convolutional neural networks (CNNs); yet, they lack a comprehensive comparison to GPUs. In this paper, we examine the potential of digital PIM for CNN acceleration through an updated quantitative comparison with GPUs, supplemented with an analysis of the overall limitations of digital PIM. We begin by investigating the different PIM architectures from a theoretical perspective to understand the underlying performance limitations and improvements compared to state-of-the-art hardware. We then uncover the tradeoffs between the different strategies through a series of benchmarks ranging from memory-bound vectored arithmetic to CNN acceleration. We conclude with insights into the general performance of digital PIM architectures for different data-intensive applications.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the performance and limitations of the Digital Processing - in - Memory (Digital PIM) architecture compared with the current state - of - the - art hardware (such as GPU) in the acceleration of Convolutional Neural Networks (CNN). Specifically, the paper aims to: 1. **Evaluate the potential of digital PIM in CNN acceleration**: Through an updated quantitative comparison, explore the performance of digital PIM in CNN acceleration and compare it with GPU. 2. **Analyze the limitations of digital PIM**: Analyze the performance limitations and improvement points of different PIM architectures from both theoretical and experimental perspectives, especially the advantages and disadvantages compared with existing hardware. 3. **Provide a comprehensive performance evaluation**: Through a series of benchmark tests, comprehensively evaluate the performance of digital PIM from basic vector arithmetic operations to the inference and training of large - scale CNN models. ### Main research contents - **Theoretical analysis**: Theoretically explore the performance limitations and improvements of different PIM architectures and compare them with existing hardware (such as GPU). - **Experimental verification**: Through a series of benchmark tests, including memory - intensive vector arithmetic operations, matrix multiplication, 2D convolution, and complete CNN inference and training, verify the actual performance of digital PIM. - **Performance indicators**: Develop multiple performance indicators to further understand the performance of the digital PIM architecture in different data - intensive applications. ### Key findings - **High computational complexity**: Digital PIM has a high computational complexity in floating - point operations, resulting in limited performance improvement in some tasks. - **High data reuse rate**: The high data reuse rate in the CNN architecture makes GPU perform well in these tasks, while the advantage of digital PIM is not obvious. - **Memory wall bottleneck**: In tasks with a low data reuse rate, digital PIM can significantly reduce memory access latency, but in tasks with a high data reuse rate, this advantage is weakened. ### Conclusion Although digital PIM performs well in some specific tasks, in terms of full - precision CNN acceleration, the digital PIM architecture under the current parameters still cannot surpass the performance of GPU. Future research can focus on applications that require low computational complexity or low data reuse rate to fully utilize the advantages of digital PIM.

ConvPIM: Evaluating Digital Processing-in-Memory through Convolutional Neural Network Acceleration

A design framework for processing-in-memory accelerator

Neural-PIM: Efficient Processing-In-Memory with Neural Approximation of Peripherals

GIM: Versatile GNN Acceleration with Reconfigurable Processing-in-Memory

pPIM: A Programmable Processor-in-Memory Architecture With Precision-Scaling for Deep Learning

ReHy: A ReRAM-based Digital/Analog Hybrid PIM Architecture for Accelerating CNN Training

Functionality-Based Processing-in-Memory Accelerator for Deep Convolutional Neural Networks

CMP-PIM: An Energy-Efficient Comparator-based Processing-In-Memory Neural Network Accelerator

Runtime Support for Accelerating CNN Models on Digital DRAM Processing-in-Memory Hardware

ReApprox-PIM: Reconfigurable Approximate Look-Up-Table (LUT)-Based Processing-in-Memory (PIM) Machine Learning Accelerator

An Efficient Racetrack Memory-Based Processing-in-Memory Architecture for Convolutional Neural Networks

Accelerating Deep Neural Networks in Processing-in-Memory Platforms: Analog or Digital Approach?

An Energy-Efficient Quantized and Regularized Training Framework for Processing-In-Memory Accelerators

PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation

PIMSAB: A P Rocessing- I N- M Emory System with S Patially- A Ware Communication and B It-Serial-aware Computation

VSPIM: SRAM Processing-in-Memory DNN Acceleration via Vector-Scalar Operations

Ultra-High-Speed Accelerator Architecture for Convolutional Neural Network Based on Processing-in-Memory Using Resistive Random Access Memory

Accelerating Neural Network Training with Processing-in-Memory GPU

AritPIM: High-Throughput In-Memory Arithmetic

DyPIM: Dynamic-Inference-Enabled Processing - In-Memory Accelerator

Accelerating Neural Network Inference with Processing-in-DRAM: From the Edge to the Cloud