Abstract:Graph Convolutional Networks (GCNs) are widely used in various domains. However, training distributed full-batch GCNs on large-scale graphs poses challenges due to inefficient memory access patterns and high communication overhead. This paper presents general and efficient aggregation operators designed for irregular memory access patterns. Additionally, we propose a pre-post-aggregation approach and a quantization with label propagation method to reduce communication costs. Combining these techniques, we develop an efficient and scalable distributed GCN training framework, \emph{SuperGCN}, for CPU-powered supercomputers. Experimental results on multiple large graph datasets show that our method achieves a speedup of up to 6$\times$ compared with the SoTA implementations, and scales to 1000s of HPC-grade CPUs, without sacrificing model convergence and accuracy. Our framework achieves performance on CPU-powered supercomputers comparable to that of GPU-powered supercomputers, with a fraction of the cost and power budget.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the efficiency and scalability challenges faced by large - scale Graph Convolutional Networks (GCNs) in distributed full - batch training. Specifically, the paper mainly addresses the following issues: 1. **Irregularity of memory access patterns and load imbalance**: Due to the sparsity and randomness of graph data, the memory access pattern is irregular, which in turn affects the computing performance. 2. **High communication overhead**: After graph partitioning, the boundary nodes between different sub - graphs need to communicate a great deal, which increases the communication overhead, especially on large - scale graphs. To solve these problems, the paper proposes an efficient and scalable distributed GCN training framework - SuperGCN, specifically for CPU - driven supercomputers. The following are the main contributions of the paper: - **Efficient graph - related CPU - level operations**: Optimized graph operation operators suitable for different CPU platforms (ARM & x86) are designed, which reduce the number of memory accesses, eliminate the reduction overhead between threads due to random connections, and ensure load balancing among CPU threads. - **Hybrid pre - aggregation and post - aggregation method**: A hybrid pre - aggregation and post - aggregation method is proposed based on the minimum vertex cover algorithm to reduce the amount of communication. This method eliminates redundant communication data by re - arranging the communication between sub - graphs. - **Low - overhead Int2 quantization method**: A low - overhead Int2 quantization method is introduced to compress the transmitted messages. Combined with the label propagation technique, it mitigates the negative impact of quantization and provides a theoretical convergence guarantee. Through these improvements, the experimental results of the SuperGCN framework on multiple large - scale graph datasets show that, compared with the existing optimal methods, it can achieve up to 6 - fold acceleration while maintaining model convergence and accuracy. In addition, this framework can scale on thousands of processors and handle large - scale graphs containing billions of edges and over 100 million nodes. ### Summary of Key Formulas 1. **Quantization and de - quantization formulas**: \[ h_{quant} = \text{round}\left(\frac{h - Z}{S}\right) \] \[ h_{dequant} = h_{quant} \times S + Z \] where $ Z=\min(h)$, $ S = \frac{\max(h)-\min(h)}{2^b - 1}$, and $ b$ is the bit width. 2. **Upper bound of gradient variance**: \[ K_l=\sum_{i}\left(\sum_{j = 1}^{N_R(i)}\sum_{u = 1}^{N_R(i)}\alpha_{ji}^2\alpha_{ui}^2D_j^{l - 1}D_u^l(S_j^{l - 1})^2(S_u^l)^2+M^2\cdot N_R(i)\sum_{j = 1}^{N_R(i)}\alpha_{ji}^2D_j^l(S_j^l)^2+N^2\cdot N_R(i)\sum_{j = 1}^{N_R(i)}\alpha_{ji}^2D_j^{l - 1}(S_j^{l - 1})^2\right) \] 3. **Unified formula for label propagation and feature propagation**: \[ H^{(l + 1)}=A H^{(l)}W^{(l)}=A^l(X + Y_{embed})(W^{(0)}W^{(1)}\cdots W^{(l)}) \] where $ H^{(l)}$ is the embedding after the $ l$ - th layer of GCN, $ A$ is the adjacency matrix, $ W^{(l)}$ is the weight matrix of the $ l$ - th layer of GCN, $ X$ is the initial feature vector, and $ Y_{embed}$ is the label embedding. Through these methods, SuperGCN not only improves the training efficiency but also ensures the accuracy of the model.

SuperGCN: General and Scalable Framework for GCN Training on CPU-powered Supercomputers

Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture

Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks

GCNear: A Hybrid Architecture for Efficient GCN Training with Near-Memory Processing

PCGCN: Partition-Centric Processing for Accelerating Graph Convolutional Network

HyGCN: A GCN Accelerator with Hybrid Architecture

Accurate, Efficient and Scalable Graph Embedding

DyGA: A Hardware-Efficient Accelerator with Traffic-Aware Dynamic Scheduling for Graph Convolutional Networks.

Accel-GCN: High-Performance GPU Accelerator Design for Graph Convolution Networks

ByteGNN: Efficient Graph Neural Network Training at Large Scale

Efficient Message Passing Architecture for GCN Training on HBM-based FPGAs with Orthogonal Topology On-Chip Networks

BatchGNN: Efficient CPU-Based Distributed GNN Training on Very Large Graphs

Accelerating Backward Aggregation in GCN Training With Execution Path Preparing on GPUs

CATGNN: Cost-Efficient and Scalable Distributed Training for Graph Neural Networks

Exploiting Parallelism with Vertex-Clustering in Processing-In-Memory-based GCN Accelerators

SPA-GCN: Efficient and Flexible GCN Accelerator with an Application for Graph Similarity Computation

L2-GCN: Layer-Wise and Learned Efficient Training of Graph Convolutional Networks

Scalable training of graph convolutional neural networks for fast and accurate predictions of HOMO-LUMO gap in molecules

A Unified CPU-GPU Protocol for GNN Training

GCoD: Graph Convolutional Network Acceleration via Dedicated Algorithm and Accelerator Co-Design

Towards efficient allocation of graph convolutional networks on hybrid computation-in-memory architecture