Abstract:In the acceleration of deep neural network training, the GPU has become the mainstream platform. GPUs face substantial challenges on GNNs, such as workload imbalance and memory access irregularities, leading to underutilized hardware. Existing solutions such as PyG, DGL with cuSPARSE, and GNNAdvisor frameworks partially address these challenges but memory traffic is still significant. We argue that drastic performance improvements can only be achieved by the vertical optimization of algorithm and system innovations, rather than treating the speedup optimization as an "after-thought" (i.e., (i) given a GNN algorithm, designing an accelerator, or (ii) given hardware, mainly optimizing the GNN algorithm). In this paper, we present MaxK-GNN, an advanced high-performance GPU training system integrating algorithm and system innovation. (i) We introduce the MaxK nonlinearity and provide a theoretical analysis of MaxK nonlinearity as a universal approximator, and present the Compressed Balanced Sparse Row (CBSR) format, designed to store the data and index of the feature matrix after nonlinearity; (ii) We design a coalescing enhanced forward computation with row-wise product-based SpGEMM Kernel using CBSR for input feature matrix fetching and strategic placement of a sparse output accumulation buffer in shared memory; (iii) We develop an optimized backward computation with outer product-based and SSpMM Kernel. We conduct extensive evaluations of MaxK-GNN and report the end-to-end system run-time. Experiments show that MaxK-GNN system could approach the theoretical speedup limit according to Amdahl's law. We achieve comparable accuracy to SOTA GNNs, but at a significantly increased speed: 3.22/4.24 times speedup (vs. theoretical limits, 5.52/7.27 times) on Reddit compared to DGL and GNNAdvisor implementations.

Legion: Automatically Pushing the Envelope of Multi-GPU System for Billion-Scale GNN Training

LSM-GNN: Large-scale Storage-based Multi-GPU GNN Training by Optimizing Data Transfer Scheme

BGS: Accelerate GNN Training on Multiple GPUs

GNNear: Accelerating Full-Batch Training of Graph Neural Networks with Near-Memory Processing

XGNN: Boosting Multi-GPU GNN Training Via Global GNN Memory Store.

GNNLab

Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture

Helios: An Efficient Out-of-core GNN Training System on Terabyte-scale Graphs with In-memory Performance

Accelerating GNN Training by Adapting Large Graphs to Distributed Heterogeneous Architectures

GNNLab: a Factored System for Sample-Based GNN Training over GPUs

BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing

Auto-Divide GNN: Accelerating GNN Training with Subgraph Division.

ByteGNN: Efficient Graph Neural Network Training at Large Scale

Scalable and Efficient Full-Graph GNN Training for Large Graphs

BatchGNN: Efficient CPU-Based Distributed GNN Training on Very Large Graphs

HyScale-GNN: A Scalable Hybrid GNN Training System on Single-Node Heterogeneous Architecture

FlashGNN: an In-SSD Accelerator for GNN Training

FastGL: A GPU-Efficient Framework for Accelerating Sampling-Based GNN Training at Large Scale

MaxK-GNN: Towards Theoretical Speed Limits for Accelerating Graph Neural Networks Training

DUCATI: A Dual-Cache Training System for Graph Neural Networks on Giant Graphs with the GPU.