Abstract:Attention Graph Neural Networks (AT-GNNs), such as GAT and Graph Transformer, have demonstrated superior performance compared to other GNNs. However, existing GNN systems struggle to efficiently train AT-GNNs on GPUs due to their intricate computation patterns. The execution of AT-GNN operations without kernel fusion results in heavy data movement and significant kernel launch overhead, while fixed thread scheduling in existing GNN kernel fusion strategies leads to sub-optimal performance, redundant computation and unbalanced workload. To address these challenges, we propose a dynamic kernel fusion framework, DF-GNN, for the AT-GNN family. DF-GNN introduces a dynamic bi-level thread scheduling strategy, enabling flexible adjustments to thread scheduling while retaining the benefits of shared memory within the fused kernel. DF-GNN tailors specific thread scheduling for operations in AT-GNNs and considers the performance bottleneck shift caused by the presence of super nodes. Additionally, DF-GNN is integrated with the PyTorch framework for high programmability. Evaluations across diverse GNN models and multiple datasets reveal that DF-GNN surpasses existing GNN kernel optimization works like cuGraph and dgNN, with speedups up to $7.0\times$ over the state-of-the-art non-fusion DGL sparse library. Moreover, it achieves an average speedup of $2.16\times$ in end-to-end training compared to the popular GNN computing framework DGL.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenge of efficiently training Attention - based Graph Neural Networks (AT - GNNs) on GPUs. Specifically, existing GNN systems face the following main problems when training AT - GNNs: 1. **Complex Computation Patterns**: The computation patterns of AT - GNNs are relatively complex, usually including three main steps: calculating attention scores on edges, normalizing these scores, and aggregating neighbor features. These operations lead to a large amount of data movement and significant kernel launch overhead. 2. **Fixed Thread Scheduling Policies**: Existing GNN kernel fusion strategies adopt fixed thread scheduling, which can lead to performance degradation, redundant computations, and workload imbalance. Especially when dealing with super - nodes (nodes with a large number of neighbors), the fixed - scheduling policy is difficult to adapt to different computational requirements, thus limiting performance improvement. To address these challenges, the paper proposes a Dynamic Kernel Fusion framework, DF - GNN. The main contributions of DF - GNN include: - **Dynamic Two - level Thread Scheduling Policy**: DF - GNN introduces a dynamic two - level thread - scheduling policy that allows each operation to flexibly adjust thread scheduling between and within blocks while retaining the benefits of shared memory. This policy can better adapt to the computational requirements of different operations in AT - GNNs. - **Optimization for Super - nodes**: DF - GNN considers the impact of the existence of super - nodes on computational performance and designs two general kernel - fusion methods (SMMF and PMF) to select the appropriate fusion method according to the characteristics of the input graph. - **Integration with the PyTorch Framework**: DF - GNN is integrated with the PyTorch framework, providing an easy - to - use API that allows users to easily call DF - GNN's optimized kernel code in PyTorch models, improving the efficiency of model training and inference. Through evaluations on multiple AT - GNNs models and diverse datasets, DF - GNN significantly outperforms existing GNN optimization methods, such as cuGraph and dgNN, in both kernel - level and end - to - end training performance, with a speed - up of up to 7.0 times.

DF-GNN: Dynamic Fusion Framework for Attention Graph Neural Networks on GPUs

fuseGNN: Accelerating Graph Convolutional Neural Network Training on GPGPU

FP-GNN: Adaptive FPGA Accelerator for Graph Neural Networks

AdaptGear: Accelerating GNN Training Via Adaptive Subgraph-Level Kernels on GPUs

GNNFlow: A Distributed Framework for Continuous Temporal GNN Learning on Dynamic Graphs

TC-GNN: Bridging Sparse GNN Computation and Dense Tensor Cores on GPUs

DyGA: A Hardware-Efficient Accelerator with Traffic-Aware Dynamic Scheduling for Graph Convolutional Networks.

GRAF: Graph Attention-aware Fusion Networks

FastGL: A GPU-Efficient Framework for Accelerating Sampling-Based GNN Training at Large Scale

DNNFusion: Accelerating Deep Neural Networks Execution with Advanced Operator Fusion

DynaGraph: dynamic graph neural networks at scale

Accel-GCN: High-Performance GPU Accelerator Design for Graph Convolution Networks

Towards Scalable GPU-Accelerated SNN Training via Temporal Fusion

MaxK-GNN: Towards Theoretical Speed Limits for Accelerating Graph Neural Networks Training

GNNAdvisor: An Adaptive and Efficient Runtime System for GNN Acceleration on GPUs

DeltaGNN: Accelerating Graph Neural Networks on Dynamic Graphs with Delta Updating

Hardware Acceleration for GCNs Via Bidirectional Fusion

EnGN: A High-Throughput and Energy-Efficient Accelerator for Large Graph Neural Networks

DynaHB: A Communication-Avoiding Asynchronous Distributed Framework with Hybrid Batches for Dynamic GNN Training

DRGN: a dynamically reconfigurable accelerator for graph neural networks

GLP4NN: A Convergence-invariant and Network-agnostic Light-weight Parallelization Framework for Deep Neural Networks on Modern GPUs