Abstract:Graph neural networks (GNNs) have recently empowered various novel computer vision (CV) tasks. In GNN-based CV tasks, a combination of CNN layers and GNN layers or only GNN layers are employed. This paper introduces GCV-Turbo, a domain-specific accelerator on FPGA for end-to-end acceleration of GNN-based CV tasks. GCV-Turbo consists of two key components: (1) a \emph{novel} hardware architecture optimized for the computation kernels in both CNNs and GNNs using the same set of computation resources. (2) a PyTorch-compatible compiler that takes a user-defined model as input, performs end-to-end optimization for the computation graph of a given GNN-based CV task, and produces optimized code for hardware execution. The hardware architecture and the compiler work synergistically to support a variety of GNN-based CV tasks. We implement GCV-Turbo on a state-of-the-art FPGA and evaluate its performance across six representative GNN-based CV tasks with diverse input data modalities (e.g., image, human skeleton, point cloud). Compared with state-of-the-art CPU (GPU) implementations, GCV-Turbo achieves an average latency reduction of $68.4\times$ ($4.1\times$) on these six GNN-based CV tasks. Moreover, GCV-Turbo supports the execution of the standalone CNNs or GNNs, achieving performance comparable to that of state-of-the-art CNN (GNN) accelerators for widely used CNN-only (GNN-only) models.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the end - to - end acceleration problem of Graph Neural Networks (GNNs) in computer vision (CV) tasks. Specifically, the paper proposes **GCV - Turbo**, a field - specific accelerator based on FPGA, which is used to accelerate computer vision tasks that combine Convolutional Neural Networks (CNNs) and Graph Neural Networks (GNNs) layers or only use GNNs layers. ### Background and challenges 1. **Applications of GNNs in CV tasks**: - **Label - efficient image classification**: GNNs can reduce the number of labeled images required for training and are suitable for few - shot learning. - **Processing of non - Euclidean data structures**: GNNs can naturally handle non - Euclidean data structures such as point clouds and 3D meshes. - **Relational learning**: GNNs can understand the relationships between complex objects through the message - passing mechanism and are suitable for tasks such as video action recognition. 2. **Deficiencies of existing accelerators**: - **Low hardware utilization**: Existing CNN accelerators and GNN accelerators are optimized for a single type of layer respectively and cannot efficiently support GNN - based CV tasks. - **High cost of data layout conversion**: The data layouts of CNN layers and GNN layers are different, and frequent data layout conversions will lead to significant performance costs. - **General - purpose processors are not suitable**: Due to the complex cache hierarchies, CPUs and GPUs are difficult to achieve low - latency inference and are not suitable for real - time application scenarios. ### Solutions 1. **GCV - Turbo architecture design**: - **Resource - sharing strategy**: GCV - Turbo adopts a resource - sharing strategy, enabling the computing kernels of CNN and GNN layers to share the same set of computing resources, improving resource utilization. - **Flexible data paths and memory organizations**: Flexible data paths and memory organizations are designed to support the efficient execution of multiple computing kernels. 2. **Compiler design**: - **End - to - end optimization**: The compiler not only optimizes a single CNN or GNN layer, but also performs end - to - end optimization on the entire computational graph, reducing inference latency. - **Specific optimization techniques**: Including data operation layer generation, layer fusion, data layout - centered mapping, sparse - aware computing primitive mapping, etc. ### Main contributions 1. **Proposing GCV - Turbo**: The first end - to - end accelerator for GNN - based CV tasks. 2. **Novel hardware architecture**: A flexible data path and memory organization are designed to support the efficient execution of multiple computing kernels. 3. **Custom - made compiler**: A custom - made compiler is developed to achieve end - to - end optimization and reduce inference latency. 4. **Performance evaluation**: Evaluations are carried out on six representative GNN - based CV tasks. Compared with the state - of - the - art CPUs and GPUs, GCV - Turbo reduces the latency by an average of 68.4 times and 4.1 times respectively. 5. **Compatibility**: When executing models that only contain CNNs or GNNs, the performance of GCV - Turbo is comparable to that of existing CNN and GNN accelerators. ### Conclusion Through innovative hardware architecture and compiler design, GCV - Turbo successfully solves the end - to - end acceleration problem in GNN - based CV tasks and provides a high - performance solution for real - time application scenarios.

GCV-Turbo: End-to-end Acceleration of GNN-based Computer Vision Tasks on FPGA

BOOST: Block Minifloat-Based On-Device CNN Training Accelerator with Transfer Learning

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

DGNN-Booster: A Generic FPGA Accelerator Framework For Dynamic Graph Neural Network Inference

DyGA: A Hardware-Efficient Accelerator with Traffic-Aware Dynamic Scheduling for Graph Convolutional Networks.

HitGNN: High-throughput GNN Training Framework on CPU+Multi-FPGA Heterogeneous Platform

Compressed CNN Training with FPGA-based Accelerator

A GPU-Outperforming FPGA Accelerator Architecture for Binary Convolutional Neural Networks

Caffeine: Toward Uniformed Representation and Acceleration for Deep Convolutional Neural Networks

Low-latency Mini-batch GNN Inference on CPU-FPGA Heterogeneous Platform

Acceleration of Deep Neural Network Training Using Field Programmable Gate Arrays

fuseGNN: Accelerating Graph Convolutional Neural Network Training on GPGPU

High Performance CNN Accelerators Based on Hardware and Algorithm Co-Optimization

Instruction driven cross-layer CNN accelerator with winograd transformation on FPGA

BlockGNN: Towards Efficient GNN Acceleration Using Block-Circulant Weight Matrices

Accelerating convolutional neural networks on FPGAs

VisionAGILE: A Versatile Domain-Specific Accelerator for Computer Vision Tasks

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs.

A survey of field programmable gate array (FPGA)-based graph convolutional neural network accelerators: challenges and opportunities

Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks.