Abstract:Field-programmable gate arrays (FPGAs) are an ideal candidate for accelerating graph neural networks (GNNs). However, the FPGA redeployment process is time-consuming when updating or switching between diverse GNN models across different applications. Existing GNN processors eliminate the need for FPGA redeployment when switching between different GNN models. However, adapting matrix multiplication types by switching processing units decreases hardware utilization. In addition, the bandwidth of DDR limits further improvements in hardware performance. This paper proposes a highly flexible FPGA-based overlay processor for GNN accelerations. Graph-OPU provides excellent flexibility and programmability for users, as the executable code of GNN models is automatically compiled and reloaded without requiring FPGA redeployment. First, we customize the compiler and instruction sets for the inference process of different GNN models. Second, we customize the datapath and optimize the data format in the microarchitecture to fully leverage the advantages of high bandwidth memory (HBM). Third, we design a unified matrix multiplication to handle both sparse-dense matrix multiplication (SpMM) and general matrix multiplication (GEMM), enhancing Graph-OPU performance. During Graph-OPU execution, the computational units are shared between SpMM and GEMM instead of being switched, which improves the hardware utilization. Finally, we implement a hardware prototype on the Xilinx Alveo U50 and test the mainstream GNN models using various datasets. Experimental results show that Graph-OPU achieves up to 1654 \(\times\) and 63 \(\times\) speedup, as well as up to 5305 \(\times\) and 422 \(\times\) energy efficiency boosts, compared to implementations on CPU and GPU, respectively. Graph-OPU outperforms state-of-the-art (SOTA) end-to-end overlay accelerators for GNN, reducing latency by an average of 1.36 \(\times\) and improving energy efficiency by 1.41 \(\times\) on average. Moreover, Graph-OPU exhibits an average 1.45 \(\times\) speed improvement in end-to-end latency over the SOTA GNN processor. Graph-OPU represents an in-depth study of an FPGA-based overlay processor for GNNs, offering high flexibility, speedup, and energy efficiency.

Accelerating Neural-ODE Inference on FPGAs with Two-Stage Structured Pruning and History-based Stepsize Search.

A Near Memory Computing FPGA Architecture for Neural Network Acceleration

HAO: Hardware-aware neural Architecture Optimization for Efficient Inference

Hardware-Software Co-optimised Fast and Accurate Deep Reconfigurable Spiking Inference Accelerator Architecture Design Methodology

E3NE: An End-to-End Framework for Accelerating Spiking Neural Networks with Emerging Neural Encoding on FPGAs

Optimizing Neural Network Inference in Edge Robotics by Harnessing FPGA Hardware Acceleration

Efficient Implementation of a Multi-Layer Gradient-Free Online-Trainable Spiking Neural Network on FPGA

WPU: A FPGA-based Scalable, Efficient and Software/Hardware Co-design Deep Neural Network Inference Acceleration Processor

Dynasparse: Accelerating GNN Inference through Dynamic Sparsity Exploitation

A Power Efficient Neural Network Implementation on Heterogeneous FPGA and GPU Devices

Invited: Algorithm-Software-Hardware Co-Design for Deep Learning Acceleration

A Design Methodology for Efficient Implementation of Deconvolutional Neural Networks on an FPGA

Towards a Uniform Architecture for the Efficient Implementation of 2D and 3D Deconvolutional Neural Networks on FPGAs

A neural network accelerated optimization method for FPGA

A Fine-Grained Sparse Accelerator for Multi-Precision DNN.

Graph-OPU: A Highly Integrated FPGA-Based Overlay Processor for Graph Neural Networks

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks

L-FNNG: Accelerating Large-Scale KNN Graph Construction on CPU-FPGA Heterogeneous Platform

ONNX-to-Hardware Design Flow for Adaptive Neural-Network Inference on FPGAs

[DL] A Survey of FPGA-based Neural Network Inference Accelerators

Design of Network-on-Chip-Based Restricted Coulomb Energy Neural Network Accelerator on FPGA Device