SuperGCN: General and Scalable Framework for GCN Training on CPU-powered Supercomputers

Chen Zhuang,Peng Chen,Xin Liu,Rio Yokota,Nikoli Dryden,Toshio Endo,Satoshi Matsuoka,Mohamed Wahib
2024-11-25
Abstract:Graph Convolutional Networks (GCNs) are widely used in various domains. However, training distributed full-batch GCNs on large-scale graphs poses challenges due to inefficient memory access patterns and high communication overhead. This paper presents general and efficient aggregation operators designed for irregular memory access patterns. Additionally, we propose a pre-post-aggregation approach and a quantization with label propagation method to reduce communication costs. Combining these techniques, we develop an efficient and scalable distributed GCN training framework, \emph{SuperGCN}, for CPU-powered supercomputers. Experimental results on multiple large graph datasets show that our method achieves a speedup of up to 6$\times$ compared with the SoTA implementations, and scales to 1000s of HPC-grade CPUs, without sacrificing model convergence and accuracy. Our framework achieves performance on CPU-powered supercomputers comparable to that of GPU-powered supercomputers, with a fraction of the cost and power budget.
Distributed, Parallel, and Cluster Computing,Performance
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the efficiency and scalability challenges faced by large - scale Graph Convolutional Networks (GCNs) in distributed full - batch training. Specifically, the paper mainly addresses the following issues: 1. **Irregularity of memory access patterns and load imbalance**: Due to the sparsity and randomness of graph data, the memory access pattern is irregular, which in turn affects the computing performance. 2. **High communication overhead**: After graph partitioning, the boundary nodes between different sub - graphs need to communicate a great deal, which increases the communication overhead, especially on large - scale graphs. To solve these problems, the paper proposes an efficient and scalable distributed GCN training framework - SuperGCN, specifically for CPU - driven supercomputers. The following are the main contributions of the paper: - **Efficient graph - related CPU - level operations**: Optimized graph operation operators suitable for different CPU platforms (ARM & x86) are designed, which reduce the number of memory accesses, eliminate the reduction overhead between threads due to random connections, and ensure load balancing among CPU threads. - **Hybrid pre - aggregation and post - aggregation method**: A hybrid pre - aggregation and post - aggregation method is proposed based on the minimum vertex cover algorithm to reduce the amount of communication. This method eliminates redundant communication data by re - arranging the communication between sub - graphs. - **Low - overhead Int2 quantization method**: A low - overhead Int2 quantization method is introduced to compress the transmitted messages. Combined with the label propagation technique, it mitigates the negative impact of quantization and provides a theoretical convergence guarantee. Through these improvements, the experimental results of the SuperGCN framework on multiple large - scale graph datasets show that, compared with the existing optimal methods, it can achieve up to 6 - fold acceleration while maintaining model convergence and accuracy. In addition, this framework can scale on thousands of processors and handle large - scale graphs containing billions of edges and over 100 million nodes. ### Summary of Key Formulas 1. **Quantization and de - quantization formulas**: \[ h_{quant} = \text{round}\left(\frac{h - Z}{S}\right) \] \[ h_{dequant} = h_{quant} \times S + Z \] where \( Z=\min(h)\), \( S = \frac{\max(h)-\min(h)}{2^b - 1}\), and \( b\) is the bit width. 2. **Upper bound of gradient variance**: \[ K_l=\sum_{i}\left(\sum_{j = 1}^{N_R(i)}\sum_{u = 1}^{N_R(i)}\alpha_{ji}^2\alpha_{ui}^2D_j^{l - 1}D_u^l(S_j^{l - 1})^2(S_u^l)^2+M^2\cdot N_R(i)\sum_{j = 1}^{N_R(i)}\alpha_{ji}^2D_j^l(S_j^l)^2+N^2\cdot N_R(i)\sum_{j = 1}^{N_R(i)}\alpha_{ji}^2D_j^{l - 1}(S_j^{l - 1})^2\right) \] 3. **Unified formula for label propagation and feature propagation**: \[ H^{(l + 1)}=A H^{(l)}W^{(l)}=A^l(X + Y_{embed})(W^{(0)}W^{(1)}\cdots W^{(l)}) \] where \( H^{(l)}\) is the embedding after the \( l\) - th layer of GCN, \( A\) is the adjacency matrix, \( W^{(l)}\) is the weight matrix of the \( l\) - th layer of GCN, \( X\) is the initial feature vector, and \( Y_{embed}\) is the label embedding. Through these methods, SuperGCN not only improves the training efficiency but also ensures the accuracy of the model.