Abstract:Near-Memory Processing (NMP) systems that integrate accelerators within DIMM (Dual-Inline Memory Module) buffer chips potentially provide high performance with relatively low design and manufacturing costs. However, an inevitable communication bottleneck arises when considering the main memory bus among peer DIMMs and the host CPU. This communication bottleneck roots in the bus-based nature and the limited point-to-point communication pattern of the main memory system. The aggregated memory bandwidth of DIMM- based NMP scales with the number of DIMMs. When the number of DIMMs in a channel scales up, the per-DIMM point-to-point communication bandwidth scales down, whereas the computation resources and local memory bandwidth per DIMM stay the same. For many important sparse data-intensive workloads like graph applications and sparse tensor algebra, we identify that communication among DIMMs and the host CPU easily dominates their processing procedure in previous DIMM-based NMP systems, which severely bottlenecks their performance.To tackle this challenge, we propose that inter-DIMM broadcast should be implemented and utilized in the main memory system of DIMM-based NMP. On the hardware side, the main memory bus naturally scales out with broadcast, where per- DIMM effective bandwidth of broadcast remains the same as the number of DIMMs grows. On the software side, many sparse applications can be implemented in a form such that broadcasts dominate their communication. Based on these ideas, we design ABC-DIMM, which Alleviates the Bottleneck of Communication in DIMM-based NMP, consisting of integral broadcast mechanisms and Broadcast-Process programming framework, with minimized modifications to commodity software-hardware stack. Our evaluation shows that ABC-DIMM offers an 8.33 × geo-mean speedup over a 16-core CPU baseline, and outperforms two NMP baselines by 2.59 × and 2.93 × on average.

ABC-DIMM: Alleviating the Bottleneck of Communication in DIMM-based Near-Memory Processing with Inter-DIMM Broadcast

DIMM-Link: Enabling Efficient Inter-DIMM Communication for Near-Memory Processing

DaDianNao: A Machine-Learning Supercomputer

G-NMP: Accelerating Graph Neural Networks with DIMM-based Near-Memory Processing

Near-Memory Processing in Action: Accelerating Personalized Recommendation with AxDIMM.

PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation

Stream-Based Data Placement for Near-Data Processing with Extended Memory

PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices

EMBA: Efficient Memory Bandwidth Allocation to Improve Performance on Intel Commodity Processor

Near-Memory Parallel Indexing and Coalescing: Enabling Highly Efficient Indirect Access for SpMV

SDP: Co-Designing Algorithm, Dataflow, and Architecture for In-SRAM Sparse NN Acceleration

Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators

Demystifying the Performance of HPC Scientific Applications on NVM-based Memory Systems

Near Data Acceleration with Concurrent Host Access

ABNDP: Co-optimizing Data Access and Load Balance in Near-Data Processing

BafSP: Co-Design of Compute SRAM and Bit-Aware Data Flip Mitigation with In-Memory Sparsity Detection for SpMM

Megalloc: Fast Distributed Memory Allocator for NVM-Based Cluster

MC-RDMA: Improving Replication Performance of RDMA-based Distributed Systems with Reliable Multicast Support

Generalized Ping-Pong: Off-Chip Memory Bandwidth Centric Pipelining Strategy for Processing-In-Memory Accelerators

Efficient Distributed Memory Management with RDMA and Caching

Scale up your In-Memory Accelerator: Leveraging Wireless-on-Chip Communication for AIMC-based CNN Inference