Abstract:Processing-In-Memory (PIM) is an effective technique that reduces data movements by integrating processing units within memory. The recent advance of "big data" and 3D stacking technology make PIM a practical and viable solution for the modern data processing workloads. It is exemplified by the recent research interests on PIM-based acceleration. Among them, TESSERACT is a PIM-enabled parallel graph processing architecture based on Micron's Hybrid Memory Cube (HMC), one of the most prominent 3D-stacked memory technologies. It implements a Pregel-like vertex-centric programming model, so that users could develop programs in the familiar interface while taking advantage of PIM. Despite the orders of magnitude speedup compared to DRAM-based systems, TESSERACT generates excessive cross-cube communications through SerDes links, whose bandwidth is much less than the aggregated local bandwidth of HMCs. Our investigation indicates that this is because of the restricted data organization required by the vertex programming model. In this paper, we argue that a PIM-based graph processing system should take data organization as a first-order design consideration. Following this principle, we propose GRAPHP, a novel HMC-based software/hardware co-designed graph processing system that drastically reduces communication and energy consumption compared to TESSERACT. GRAPHP features three key techniques. 1) "Source-cut" partitioning, which fundamentally changes the cross-cube communication from one remote put per cross-cube edge to one update per replica. 2) "Two-phase Vertex Program", a programming model designed for the "source-cut" partitioning with two operations: GenUpdate and ApplyUpdate. 3) Hierarchical communication and overlapping, which further improves performance with unique opportunities offered by the proposed partitioning and programming model. We evaluate GRAPHP using a cycle accurate simulator with 5 real-world graphs and 4 algorithms. The results show that it provides on average 1.7 speedup and 89% energy saving compared to TESSERACT.

DIMMining: Pruning-Efficient and Parallel Graph Mining on Near-Memory-Computing

DIMMining

Current Research Status and Future Prospect of the In-Memory Computing

PIM-DH: Re RAM-based Processing-in-Memory Architecture for Deep Hashing Acceleration

PIMMiner: A High-performance PIM Architecture-aware Graph Mining Framework

G-NMP: Accelerating Graph Neural Networks with DIMM-based Near-Memory Processing

GraphH: A Processing-in-Memory Architecture for Large-Scale Graph Processing

A Distributed Graph-Parallel Computing System with Lightweight Communication Overhead

FINGERS: exploiting fine-grained parallelism in graph mining accelerators

GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition.

A Stack-Centric Processing Model for Iterative Processing

LightGraph: Lighten Communication in Distributed Graph-Parallel Processing

Parallel Data Mining on Graphics Processors

An Energy-Efficient In-Memory Accelerator for Graph Construction and Updating

Asynchronous Parallel Dijkstra's Algorithm on Intel Xeon Phi Processor - How to Accelerate Irregular Memory Access Algorithm.

GraphMiner

GraphR: Accelerating Graph Processing Using ReRAM

Parallel data mining techniques on Graphics Processing Unit with Compute Unified Device Architecture (CUDA)

A GPU-based Graph Pattern Mining System.

An optimized architecture for accelerating graph computing on FPGAs

An Order Sampling Processing-in-Memory Architecture for Approximate Graph Pattern Mining