Abstract:Training a Graph Neural Network (GNN) model on large-scale graphs involves a high volume of data communication and computations. While state-of-the-art CPUs and GPUs feature high computing power, the Standard GNN training protocol adopted in existing GNN frameworks cannot efficiently utilize the platform resources. To this end, we propose a novel Unified CPU-GPU protocol that can improve the resource utilization of GNN training on a CPU-GPU platform. The Unified CPU-GPU protocol instantiates multiple GNN training processes in parallel on both the CPU and the GPU. By allocating training processes on the CPU to perform GNN training collaboratively with the GPU, the proposed protocol improves the platform resource utilization and reduces the CPU-GPU data transfer overhead. Since the performance of a CPU and a GPU varies, we develop a novel load balancer that balances the workload dynamically between CPUs and GPUs during runtime. We evaluate our protocol using two representative GNN sampling algorithms, with two widely-used GNN models, on three datasets. Compared with the standard training protocol adopted in the state-of-the-art GNN frameworks, our protocol effectively improves resource utilization and overall training time. On a platform where the GPU moderately outperforms the CPU, our protocol speeds up GNN training by up to 1.41x. On a platform where the GPU significantly outperforms the CPU, our protocol speeds up GNN training by up to 1.26x. Our protocol is open-sourced and can be seamlessly integrated into state-of-the-art GNN frameworks and accelerate GNN training. Our protocol particularly benefits those with limited GPU access due to its high demand.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problems of low resource utilization efficiency and high data transmission overhead when training Graph Neural Network (GNN) models on large - scale graphs. Although modern CPUs and GPUs have powerful computing capabilities and memory bandwidth, the standard GNN training protocols adopted by existing GNN training frameworks (such as PyTorch - Geometric and Deep Graph Library) cannot efficiently utilize the resources of these platforms, resulting in long - lasting large - scale GNN training, sometimes taking hours or even days. Specifically, the paper points out the following deficiencies in the existing standard GNN training protocols: 1. **Irrational task allocation**: Most of the workload is offloaded to the GPU, while the CPU is idle most of the time. This task allocation method does not fully utilize the computing power of the CPU, because modern CPUs are comparable to GPUs in GNN training performance. 2. **Coarse - grained task scheduling**: The existing protocols adopt coarse - grained task scheduling, alternately executing memory - intensive and computationally - intensive operations, resulting in under - utilization of memory bandwidth in computationally - intensive tasks and under - utilization of computing cores in memory - intensive tasks. To solve these problems, the author proposes a new Unified CPU - GPU training protocol, which improves the utilization rate of platform resources and reduces the data transmission overhead between the CPU and the GPU by allocating GNN training tasks between the CPU and the GPU. In addition, the author also develops a Dynamic Load Balancer to ensure the dynamic balance of the workload between the CPU and the GPU at runtime. ### Main contributions - **Detailed analysis**: Conduct a detailed analysis of existing GNN frameworks and identify inefficiencies in the standard GNN training protocol. - **New protocol**: Propose a new Unified CPU - GPU training protocol, which effectively improves the utilization of computing resources and memory bandwidth. - **Dynamic load balancer**: Develop a new Dynamic Load Balancer that can dynamically balance the workload between the CPU and the GPU at runtime. - **Experimental evaluation**: Conduct evaluations on different CPU - GPU platforms. The results show that on platforms where the GPU is moderately superior to the CPU, the new protocol can accelerate GNN training by up to 1.41 times; on platforms where the GPU is significantly superior to the CPU, the new protocol can accelerate GNN training by up to 1.26 times. - **Open - source implementation**: The new protocol has been open - sourced and can be seamlessly integrated into existing GNN frameworks, such as PyG and DGL, to improve GNN training performance. ### Formula presentation To better understand the training process of GNN models, the formulas of two representative GNN models are presented here: 1. **GCN (Graph Convolutional Network)**: \[ H^{(l)}=\sigma(\hat{A}H^{(l - 1)}W^{(l)}) \] where: - \(H^{(l)}\) represents the feature matrix of the \(l\) - th layer. - \(W^{(l)}\) represents the weight matrix of the \(l\) - th layer. - \(\hat{A}\) is the normalized adjacency matrix. - \(\sigma(\cdot)\) is the activation function. 2. **GraphSAGE**: \[ H^{(l)}=\sigma(H^{(l - 1)}W_1+\hat{A}H^{(l - 1)}W_2) \] where: - \(H^{(l)}\) represents the feature matrix of the \(l\) - th layer. - \(W_1\) and \(W_2\) are two different weight matrices respectively. - \(\hat{A}\) is the normalized adjacency matrix. - \(\sigma(\cdot)\) is the activation function. These formulas show how GNN models generate higher - order neighbor information representations by aggregating and transforming node features.

A Unified CPU-GPU Protocol for GNN Training

HitGNN: High-throughput GNN Training Framework on CPU+Multi-FPGA Heterogeneous Platform

NeutronOrch: Rethinking Sample-Based GNN Training under CPU-GPU Heterogeneous Environments

HyScale-GNN: A Scalable Hybrid GNN Training System on Single-Node Heterogeneous Architecture

HP-GNN: Generating High Throughput GNN Training Implementation on CPU-FPGA Heterogeneous Platform

ByteGNN: Efficient Graph Neural Network Training at Large Scale

Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture

BatchGNN: Efficient CPU-Based Distributed GNN Training on Very Large Graphs

Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration

NeutronTP: Load-Balanced Distributed Full-Graph GNN Training with Tensor Parallelism

GNNLab

BGS: Accelerate GNN Training on Multiple GPUs

Accelerating GNN Training by Adapting Large Graphs to Distributed Heterogeneous Architectures

GNNPipe: Scaling Deep GNN Training with Pipelined Model Parallelism

HongTu: Scalable Full-Graph GNN Training on Multiple GPUs (via communication-optimized CPU data offloading)

Graph Neural Network Training with Data Tiering

GNNLab: a Factored System for Sample-Based GNN Training over GPUs

GraphTensor: Comprehensive GNN-Acceleration Framework for Efficient Parallel Processing of Massive Datasets

CoGNN: An Algorithm-Hardware Co-Design Approach to Accelerate GNN Inference With Minibatch Sampling