DF-GNN: Dynamic Fusion Framework for Attention Graph Neural Networks on GPUs

Jiahui Liu,Zhenkun Cai,Zhiyong Chen,Minjie Wang
2024-11-25
Abstract:Attention Graph Neural Networks (AT-GNNs), such as GAT and Graph Transformer, have demonstrated superior performance compared to other GNNs. However, existing GNN systems struggle to efficiently train AT-GNNs on GPUs due to their intricate computation patterns. The execution of AT-GNN operations without kernel fusion results in heavy data movement and significant kernel launch overhead, while fixed thread scheduling in existing GNN kernel fusion strategies leads to sub-optimal performance, redundant computation and unbalanced workload. To address these challenges, we propose a dynamic kernel fusion framework, DF-GNN, for the AT-GNN family. DF-GNN introduces a dynamic bi-level thread scheduling strategy, enabling flexible adjustments to thread scheduling while retaining the benefits of shared memory within the fused kernel. DF-GNN tailors specific thread scheduling for operations in AT-GNNs and considers the performance bottleneck shift caused by the presence of super nodes. Additionally, DF-GNN is integrated with the PyTorch framework for high programmability. Evaluations across diverse GNN models and multiple datasets reveal that DF-GNN surpasses existing GNN kernel optimization works like cuGraph and dgNN, with speedups up to $7.0\times$ over the state-of-the-art non-fusion DGL sparse library. Moreover, it achieves an average speedup of $2.16\times$ in end-to-end training compared to the popular GNN computing framework DGL.
Machine Learning,Performance
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenge of efficiently training Attention - based Graph Neural Networks (AT - GNNs) on GPUs. Specifically, existing GNN systems face the following main problems when training AT - GNNs: 1. **Complex Computation Patterns**: The computation patterns of AT - GNNs are relatively complex, usually including three main steps: calculating attention scores on edges, normalizing these scores, and aggregating neighbor features. These operations lead to a large amount of data movement and significant kernel launch overhead. 2. **Fixed Thread Scheduling Policies**: Existing GNN kernel fusion strategies adopt fixed thread scheduling, which can lead to performance degradation, redundant computations, and workload imbalance. Especially when dealing with super - nodes (nodes with a large number of neighbors), the fixed - scheduling policy is difficult to adapt to different computational requirements, thus limiting performance improvement. To address these challenges, the paper proposes a Dynamic Kernel Fusion framework, DF - GNN. The main contributions of DF - GNN include: - **Dynamic Two - level Thread Scheduling Policy**: DF - GNN introduces a dynamic two - level thread - scheduling policy that allows each operation to flexibly adjust thread scheduling between and within blocks while retaining the benefits of shared memory. This policy can better adapt to the computational requirements of different operations in AT - GNNs. - **Optimization for Super - nodes**: DF - GNN considers the impact of the existence of super - nodes on computational performance and designs two general kernel - fusion methods (SMMF and PMF) to select the appropriate fusion method according to the characteristics of the input graph. - **Integration with the PyTorch Framework**: DF - GNN is integrated with the PyTorch framework, providing an easy - to - use API that allows users to easily call DF - GNN's optimized kernel code in PyTorch models, improving the efficiency of model training and inference. Through evaluations on multiple AT - GNNs models and diverse datasets, DF - GNN significantly outperforms existing GNN optimization methods, such as cuGraph and dgNN, in both kernel - level and end - to - end training performance, with a speed - up of up to 7.0 times.