Abstract:Field-programmable gate arrays (FPGAs) are an ideal candidate for accelerating graph neural networks (GNNs). However, the FPGA redeployment process is time-consuming when updating or switching between diverse GNN models across different applications. Existing GNN processors eliminate the need for FPGA redeployment when switching between different GNN models. However, adapting matrix multiplication types by switching processing units decreases hardware utilization. In addition, the bandwidth of DDR limits further improvements in hardware performance. This paper proposes a highly flexible FPGA-based overlay processor for GNN accelerations. Graph-OPU provides excellent flexibility and programmability for users, as the executable code of GNN models is automatically compiled and reloaded without requiring FPGA redeployment. First, we customize the compiler and instruction sets for the inference process of different GNN models. Second, we customize the datapath and optimize the data format in the microarchitecture to fully leverage the advantages of high bandwidth memory (HBM). Third, we design a unified matrix multiplication to handle both sparse-dense matrix multiplication (SpMM) and general matrix multiplication (GEMM), enhancing Graph-OPU performance. During Graph-OPU execution, the computational units are shared between SpMM and GEMM instead of being switched, which improves the hardware utilization. Finally, we implement a hardware prototype on the Xilinx Alveo U50 and test the mainstream GNN models using various datasets. Experimental results show that Graph-OPU achieves up to 1654 \(\times\) and 63 \(\times\) speedup, as well as up to 5305 \(\times\) and 422 \(\times\) energy efficiency boosts, compared to implementations on CPU and GPU, respectively. Graph-OPU outperforms state-of-the-art (SOTA) end-to-end overlay accelerators for GNN, reducing latency by an average of 1.36 \(\times\) and improving energy efficiency by 1.41 \(\times\) on average. Moreover, Graph-OPU exhibits an average 1.45 \(\times\) speed improvement in end-to-end latency over the SOTA GNN processor. Graph-OPU represents an in-depth study of an FPGA-based overlay processor for GNNs, offering high flexibility, speedup, and energy efficiency.

TuNao: A High-Performance and Energy-Efficient Reconfigurable Accelerator for Graph Processing

High-performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System

Grapher: A Reconfigurable Graph Computing Accelerator with Optimized Processing Elements

DyGA: A Hardware-Efficient Accelerator with Traffic-Aware Dynamic Scheduling for Graph Convolutional Networks.

Towards High-Performance Graph Processing: From a Hardware/Software Co-Design Perspective

OmniGraph: A Scalable Hardware Accelerator for Graph Processing

HyTGraph: GPU-Accelerated Graph Processing with Hybrid Transfer Management

Domino: an Asynchronous and Energy-efficient Accelerator for Graph Processing: (abstract Only)

An optimized architecture for accelerating graph computing on FPGAs

Balancing Memory Accesses for Energy-Efficient Graph Analytics Accelerators.

EnGN: A High-Throughput and Energy-Efficient Accelerator for Large Graph Neural Networks

GraphIA: an In-Situ Accelerator for Large-Scale Graph Processing.

GraphR: Accelerating Graph Processing Using ReRAM

Towards Power Efficient DNN Accelerator Design on Reconfigurable Platform

Graphitron: A Domain Specific Language for FPGA-based Graph Processing Accelerator Generation

Characterization of immunoglobulins from the brown bullhead (Ictalurus nebulosus) produced against a naturally occurring bacterial pathogen, Aeromonas hydrophila.

SoGraph: A State-Aware Architecture for Out-of-Memory Graph Processing on HBM-Equipped FPGAs

Accelerating Graph Analytics on a Reconfigurable Architecture with a Data-Indirect Prefetcher

Graph-OPU: A Highly Integrated FPGA-Based Overlay Processor for Graph Neural Networks

Foregraph: Exploring Large-Scale Graph Processing On Multi-Fpga Architecture

TEGRA -- Scaling Up Terascale Graph Processing with Disaggregated Computing