A Conflict-free Scheduler for High-performance Graph Processing on Multi-pipeline FPGAs.

Qinggang Wang,Long Zheng,Jieshan Zhao,Xiaofei Liao,Hai Jin,Jingling Xue
DOI: https://doi.org/10.1145/3390523
IF: 1.444
2020-01-01
ACM Transactions on Architecture and Code Optimization
Abstract:FPGA-based graph processing accelerators are nowadays equipped with multiple pipelines for hardware acceleration of graph computations. However, their multi-pipeline efficiency can suffer greatly from the considerable overheads caused by the read/write conflicts in their on-chip BRAM from different pipelines, leading to significant performance degradation and poor scalability. In this article, we investigate the underlying causes behind such inter-pipeline read/write conflicts by focusing on multi-pipeline FPGAs for accelerating Sparse Matrix Vector Multiplication (SpMV) arising in graph processing. We exploit our key insight that the problem of eliminating inter-pipeline read/write conflicts for SpMV can be formulated as one of solving a row- and column-wise tiling problem for its associated adjacency matrix. However, how to partition a sparse adjacency matrix obtained from any graph with respect to a set of pipelines by both eliminating all the inter-pipeline read/write conflicts and keeping all the pipelines reasonably load-balanced is challenging. We present a conflict-free scheduler, WaveScheduler, that can dispatch different sub-matrix tiles to different pipelines without any read/write conflict. We also introduce two optimizations that are specifically tailored for graph processing, “degree-aware vertex index renaming” for improving load balancing and “data re-organization” for enabling sequential off-chip memory access, for all the pipelines. Our evaluation on Xilinx®Alveo™ U250 accelerator card with 16 pipelines shows that WaveScheduler can achieve up to 3.57 GTEPS, running much faster than native scheduling and two state-of-the-art FPGA-based graph accelerators (by 6.48× for “native,” 2.54× for HEGP, and 2.11× for ForeGraph), on average. In particular, these performance gains also scale up significantly as the number of pipelines increases.
What problem does this paper attempt to address?