Inclusive-PIM: Hardware-Software Co-design for Broad Acceleration on Commercial PIM Architectures

Johnathan Alsop,Shaizeen Aga,Mohamed Ibrahim,Mahzabeen Islam,Andrew Mccrabb,Nuwan Jayasena
DOI: https://doi.org/10.48550/arXiv.2309.07984
2024-01-18
Abstract:Continual demand for memory bandwidth has made it worthwhile for memory vendors to reassess processing in memory (PIM), which enables higher bandwidth by placing compute units in/near-memory. As such, memory vendors have recently proposed commercially viable PIM designs. However, these proposals are largely driven by the needs of (a narrow set of) machine learning (ML) primitives. While such proposals are reasonable given the the growing importance of ML, as memory is a pervasive component, %in this work, we make there is a case for a more inclusive PIM design that can accelerate primitives across domains. In this work, we ascertain the capabilities of commercial PIM proposals to accelerate various primitives across domains. We first begin with outlining a set of characteristics, termed PIM-amenability-test, which aid in assessing if a given primitive is likely to be accelerated by PIM. Next, we apply this test to primitives under study to ascertain efficient data-placement and orchestration to map the primitives to underlying PIM architecture. We observe here that, even though primitives under study are largely PIM-amenable, existing commercial PIM proposals do not realize their performance potential for these primitives. To address this, we identify bottlenecks that arise in PIM execution and propose hardware and software optimizations which stand to broaden the acceleration reach of commercial PIM designs (improving average PIM speedups from 1.12x to 2.49x relative to a GPU baseline). Overall, while we believe emerging commercial PIM proposals add a necessary and complementary design point in the application acceleration space, hardware-software co-design is necessary to deliver their benefits broadly.
Hardware Architecture
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to evaluate and improve the ability of commercial Processing - in - Memory (PIM) architectures in accelerating cross - domain basic operations (primitives). Specifically, the paper attempts to solve the following key problems: 1. **Limitations of existing PIM designs**: - Most current commercial PIM designs are mainly optimized for specific operations in machine learning (ML), especially operations such as dense matrix - vector multiplication. However, these designs do not fully consider the operational requirements of other domains, resulting in a limited application range. 2. **Acceleration potential of cross - domain operations**: - The paper explores how to make PIM designs more widely accelerate basic operations in different domains, including scientific computing, machine learning, and graph analysis. This is not limited to existing ML operations but also includes other operations that may be limited by memory bandwidth. 3. **Co - design of hardware and software**: - The paper emphasizes the importance of co - designing hardware and software to fully utilize the advantages of the PIM architecture. By identifying the bottlenecks in existing PIM executions and proposing hardware enhancements and software optimization measures, the paper shows how to significantly improve the acceleration effect of PIM. 4. **Data placement and computation orchestration**: - The paper proposes a method named "PIM - amenability - test" to evaluate whether an operation is suitable for acceleration by PIM and to guide programmers on how to efficiently map the operation to the PIM architecture. This method helps determine data placement and computation orchestration strategies to maximize the performance improvement of PIM. 5. **Performance modeling and optimization**: - Through performance modeling, the paper finds that even with carefully arranged data placement and computation orchestration, existing commercial PIM designs still fail to fully realize their performance potential. Therefore, the paper further analyzes the bottlenecks in PIM executions and proposes targeted optimization measures, such as architecture - aware scheduling, sparsity - aware orchestration, and cache - aware offloading. ### Main contributions - **Cross - domain evaluation**: This is the first comprehensive evaluation of emerging commercial PIM designs, covering basic operations from multiple domains. - **PIM - amenability - test**: Developed a set of PIM - amenability - test to help programmers evaluate whether an operation is suitable for acceleration by PIM and guide efficient mapping strategies. - **Bottleneck identification and optimization**: Identified the bottlenecks in existing PIM systems and proposed hardware enhancements and software optimization measures, significantly improving the acceleration range and performance of PIM. - **Inclusive design**: Advocates a more inclusive PIM design that can widely accelerate non - ML operations while giving priority to mainstream ML operations. Through the above efforts, the paper shows how the co - design of hardware and software can make the PIM architecture more widely applicable to various high - performance computing scenarios, thereby significantly improving its practicality and performance.