Accelerating Large-Scale Graph Analytics with FPGA and HMC

Soroosh Khoram,Jialiang Zhang,Maxwell Strange,Jing Li
DOI: https://doi.org/10.1109/FCCM.2017.58
2017-01-01
Abstract:Graph analytics that explores the relationship among interconnected entities is becoming increasingly important due to its broad applicability from machine learning to social science. However, one major challenge for graph processing systems is the irregular data access pattern of graph computation which can significantly degrade the performance. The algorithms, software, and hardware that have been tailored for mainstream parallel applications are, as a result, generally not effective for massive-scale sparse graphs from the real world due to their complexity and irregularity. To address the performance issues in large-scale graph analytics, we combine the emerging Hybrid Memory Cube (HMC) with a modern FPGA in order to achieve exceptional random access performance without any loss of flexibility or efficiency in computation. In particular, we develop collaborative software/hardware techniques to perform a level-synchronized breadth first search (BFS) on the FPGA-HMC platform. From the software perspective, we develop an architecture-aware graph clustering algorithm that fully exploits the platform's capability to improve data locality and memory access efficiency. For each input graph, this algorithm provides an efficient data layout that allows the FPGA to coalesce memory requests into the largest possible HMC payload requests so that the number of memory requests, which is the primary factor in runtime, can be minimized. From the hardware perspective, we further improve the FPGA-HMC graph processor architecture by adding a merging unit. The merging unit takes the best advantage of the increased data locality resulting from graph clustering. We evaluated the performance of our BFS implementation using the AC-510 development kit from Micron over a set of benchmarks from a wide range of applications. We observed that the combination of the clustering algorithm and the merging hardware achieved 2.8 × average performance improvement compared to the latest FPGA-HMC based graph processing system.
What problem does this paper attempt to address?