Abstract:Processing-In-Memory (PIM) is an effective technique that reduces data movements by integrating processing units within memory. The recent advance of "big data" and 3D stacking technology make PIM a practical and viable solution for the modern data processing workloads. It is exemplified by the recent research interests on PIM-based acceleration. Among them, TESSERACT is a PIM-enabled parallel graph processing architecture based on Micron's Hybrid Memory Cube (HMC), one of the most prominent 3D-stacked memory technologies. It implements a Pregel-like vertex-centric programming model, so that users could develop programs in the familiar interface while taking advantage of PIM. Despite the orders of magnitude speedup compared to DRAM-based systems, TESSERACT generates excessive cross-cube communications through SerDes links, whose bandwidth is much less than the aggregated local bandwidth of HMCs. Our investigation indicates that this is because of the restricted data organization required by the vertex programming model. In this paper, we argue that a PIM-based graph processing system should take data organization as a first-order design consideration. Following this principle, we propose GRAPHP, a novel HMC-based software/hardware co-designed graph processing system that drastically reduces communication and energy consumption compared to TESSERACT. GRAPHP features three key techniques. 1) "Source-cut" partitioning, which fundamentally changes the cross-cube communication from one remote put per cross-cube edge to one update per replica. 2) "Two-phase Vertex Program", a programming model designed for the "source-cut" partitioning with two operations: GenUpdate and ApplyUpdate. 3) Hierarchical communication and overlapping, which further improves performance with unique opportunities offered by the proposed partitioning and programming model. We evaluate GRAPHP using a cycle accurate simulator with 5 real-world graphs and 4 algorithms. The results show that it provides on average 1.7 speedup and 89% energy saving compared to TESSERACT.

TOD-Tree: Task-Overlapped Direct Send Tree Image Compositing for Hybrid MPI Parallelism and GPUs

MSKD: multi-split KD-tree design on GPU

Realtime Ray Tracing on a Hibrid Parallel Architecture

Parallel Transient Stability-Constrained Optimal Power Flow Using GPU as Coprocessor.

Parallel Volume Rendering on Distributed-Memory Multiprocessor System

OpenCL-Based Real-Time KD-Tree and Raytracing for Dynamic Scene

P-Cloth: Interactive Complex Cloth Simulation on Multi-GPU Systems Using Dynamic Matrix Assembly and Pipelined Implicit Integrators

Implementation of a Parallel Tree Method on a GPU

Balancing Cpu And Gpu: Real-Time Visualization Of Large Scale 3d Scanning Models

Hybrid CPU-GPU scheduling and execution of tree traversals

Parallel Shear-Warp Factorization on THPCS

POSTER: High Performance GPU Concurrent B plus tree

A Graphics Processing Unit Implementation and Optimization for Parallel Double-Difference Seismic Tomography

Exploring GPU Stream-Aware Message Passing using Triggered Operations

Efficient Pipelining Parallel Methods For Image Compositing In Sort-Last Rendering

GPU Support for Automatic Generation of Finite-Differences Stencil Kernels

Geometric Partitioning and Ordering Strategies for Task Mapping on Parallel Computers

Petascale XCT: 3D Image Reconstruction with Hierarchical Communications on Multi-GPU Nodes

Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Task-based Augmented Contour Trees with Fibonacci Heaps

GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition.