Abstract:In the acceleration of deep neural network training, the GPU has become the mainstream platform. GPUs face substantial challenges on GNNs, such as workload imbalance and memory access irregularities, leading to underutilized hardware. Existing solutions such as PyG, DGL with cuSPARSE, and GNNAdvisor frameworks partially address these challenges but memory traffic is still significant. We argue that drastic performance improvements can only be achieved by the vertical optimization of algorithm and system innovations, rather than treating the speedup optimization as an "after-thought" (i.e., (i) given a GNN algorithm, designing an accelerator, or (ii) given hardware, mainly optimizing the GNN algorithm). In this paper, we present MaxK-GNN, an advanced high-performance GPU training system integrating algorithm and system innovation. (i) We introduce the MaxK nonlinearity and provide a theoretical analysis of MaxK nonlinearity as a universal approximator, and present the Compressed Balanced Sparse Row (CBSR) format, designed to store the data and index of the feature matrix after nonlinearity; (ii) We design a coalescing enhanced forward computation with row-wise product-based SpGEMM Kernel using CBSR for input feature matrix fetching and strategic placement of a sparse output accumulation buffer in shared memory; (iii) We develop an optimized backward computation with outer product-based and SSpMM Kernel. We conduct extensive evaluations of MaxK-GNN and report the end-to-end system run-time. Experiments show that MaxK-GNN system could approach the theoretical speedup limit according to Amdahl's law. We achieve comparable accuracy to SOTA GNNs, but at a significantly increased speed: 3.22/4.24 times speedup (vs. theoretical limits, 5.52/7.27 times) on Reddit compared to DGL and GNNAdvisor implementations.

Large Scale Artificial Neural Network Training Using Multi-GPUs

Large Scale Recurrent Neural Network on GPU

A parallel computing platform for training large scale neural networks

Artificial Neural Network Computation On Graphic Process Unit

GPU-accelerated artificial neural network potential for molecular dynamics simulation

LSM-GNN: Large-scale Storage-based Multi-GPU GNN Training by Optimizing Data Transfer Scheme

A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs

HyScale-GNN: A Scalable Hybrid GNN Training System on Single-Node Heterogeneous Architecture

A Unified CPU-GPU Protocol for GNN Training

HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training.

Data-parallel distributed training of very large models beyond GPU capacity

A Practical Implementation of GPU based Accelerator for Deep Neural Networks

An Efficient 2D Method for Training Super-Large Deep Learning Models

MaxK-GNN: Towards Theoretical Speed Limits for Accelerating Graph Neural Networks Training

Accelerating Spatiotemporal Supervised Training of Large-Scale Spiking Neural Networks on GPU

GNNear: Accelerating Full-Batch Training of Graph Neural Networks with Near-Memory Processing

High Performance Simulation of Spiking Neural Network on GPGPUs

MariusGNN: Resource-Efficient Out-of-Core Training of Graph Neural Networks

Legion: Automatically Pushing the Envelope of Multi-GPU System for Billion-Scale GNN Training

Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration