Abstract:Continual demand for memory bandwidth has made it worthwhile for memory vendors to reassess processing in memory (PIM), which enables higher bandwidth by placing compute units in/near-memory. As such, memory vendors have recently proposed commercially viable PIM designs. However, these proposals are largely driven by the needs of (a narrow set of) machine learning (ML) primitives. While such proposals are reasonable given the the growing importance of ML, as memory is a pervasive component, %in this work, we make there is a case for a more inclusive PIM design that can accelerate primitives across domains. In this work, we ascertain the capabilities of commercial PIM proposals to accelerate various primitives across domains. We first begin with outlining a set of characteristics, termed PIM-amenability-test, which aid in assessing if a given primitive is likely to be accelerated by PIM. Next, we apply this test to primitives under study to ascertain efficient data-placement and orchestration to map the primitives to underlying PIM architecture. We observe here that, even though primitives under study are largely PIM-amenable, existing commercial PIM proposals do not realize their performance potential for these primitives. To address this, we identify bottlenecks that arise in PIM execution and propose hardware and software optimizations which stand to broaden the acceleration reach of commercial PIM designs (improving average PIM speedups from 1.12x to 2.49x relative to a GPU baseline). Overall, while we believe emerging commercial PIM proposals add a necessary and complementary design point in the application acceleration space, hardware-software co-design is necessary to deliver their benefits broadly.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to evaluate and improve the ability of commercial Processing - in - Memory (PIM) architectures in accelerating cross - domain basic operations (primitives). Specifically, the paper attempts to solve the following key problems: 1. **Limitations of existing PIM designs**: - Most current commercial PIM designs are mainly optimized for specific operations in machine learning (ML), especially operations such as dense matrix - vector multiplication. However, these designs do not fully consider the operational requirements of other domains, resulting in a limited application range. 2. **Acceleration potential of cross - domain operations**: - The paper explores how to make PIM designs more widely accelerate basic operations in different domains, including scientific computing, machine learning, and graph analysis. This is not limited to existing ML operations but also includes other operations that may be limited by memory bandwidth. 3. **Co - design of hardware and software**: - The paper emphasizes the importance of co - designing hardware and software to fully utilize the advantages of the PIM architecture. By identifying the bottlenecks in existing PIM executions and proposing hardware enhancements and software optimization measures, the paper shows how to significantly improve the acceleration effect of PIM. 4. **Data placement and computation orchestration**: - The paper proposes a method named "PIM - amenability - test" to evaluate whether an operation is suitable for acceleration by PIM and to guide programmers on how to efficiently map the operation to the PIM architecture. This method helps determine data placement and computation orchestration strategies to maximize the performance improvement of PIM. 5. **Performance modeling and optimization**: - Through performance modeling, the paper finds that even with carefully arranged data placement and computation orchestration, existing commercial PIM designs still fail to fully realize their performance potential. Therefore, the paper further analyzes the bottlenecks in PIM executions and proposes targeted optimization measures, such as architecture - aware scheduling, sparsity - aware orchestration, and cache - aware offloading. ### Main contributions - **Cross - domain evaluation**: This is the first comprehensive evaluation of emerging commercial PIM designs, covering basic operations from multiple domains. - **PIM - amenability - test**: Developed a set of PIM - amenability - test to help programmers evaluate whether an operation is suitable for acceleration by PIM and guide efficient mapping strategies. - **Bottleneck identification and optimization**: Identified the bottlenecks in existing PIM systems and proposed hardware enhancements and software optimization measures, significantly improving the acceleration range and performance of PIM. - **Inclusive design**: Advocates a more inclusive PIM design that can widely accelerate non - ML operations while giving priority to mainstream ML operations. Through the above efforts, the paper shows how the co - design of hardware and software can make the PIM architecture more widely applicable to various high - performance computing scenarios, thereby significantly improving its practicality and performance.

Inclusive-PIM: Hardware-Software Co-design for Broad Acceleration on Commercial PIM Architectures

A design framework for processing-in-memory accelerator

PIM-DH: Re RAM-based Processing-in-Memory Architecture for Deep Hashing Acceleration

Shared-PIM: Enabling Concurrent Computation and Data Flow for Faster Processing-in-DRAM

Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product

PIM-HLS: An Automatic Hardware Generation Tool for Heterogeneous Processing-In-Memory-based Neural Network Accelerators.

Methodologies, Workloads, and Tools for Processing-in-Memory: Enabling the Adoption of Data-Centric Architectures

UM-PIM: DRAM-based PIM with Uniform & Shared Memory Space

PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation

PIM-MMU: A Memory Management Unit for Accelerating Data Transfers in Commercial PIM Systems

Balanced Data Placement for GEMV Acceleration with Processing-In-Memory

Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

The BRAM is the Limit: Shattering Myths, Shaping Standards, and Building Scalable PIM Accelerators

SimplePIM: A Software Framework for Productive and Efficient Processing-in-Memory

PyPIM: Integrating Digital Processing-in-Memory from Microarchitectural Design to Python Tensors

PIM-AI: A Novel Architecture for High-Efficiency LLM Inference

DyPIM: Dynamic-Inference-Enabled Processing - In-Memory Accelerator

Generalized Ping-Pong: Off-Chip Memory Bandwidth Centric Pipelining Strategy for Processing-In-Memory Accelerators

Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology

PIMCoSim: Hardware/Software Co-Simulator for Exploring Processing-in-Memory Architectures

IMAGine: An In-Memory Accelerated GEMV Engine Overlay