Abstract:Present distributed graph training frameworks evenly partition a large graph into small chunks to suit for distributed storage, leverage an independent interface to access neighbors, and train Graph Neural Networks (GNNs) in a cluster of machines to update weights. Nevertheless, they consider a separate design of storage and training, taking huge communication cost for retrieving neighborhoods. During the storage phase, traditional heuristic graph partitioning not only suffers from memory overhead because of loading full graph into the memory, but also neglects meaningful node attributes to assist node or edge assignment by grouping semantically related structures (e.g., user interests, item categories in the recommendation graph). What's more, in the weight update phrase, directly averaging synchronization is difficult to tackle with heterogeneous local models where each machine's data is load from different subgraphs, resulting in slow convergence. To solve these problems, we propose a novel distributed graph training approach, \textit{Attribute-driven Streaming Edge Partitioning with Reconciliations}, where the local model loads only subgraph stored on its own machine to make fewer communications. ASEPR firstly clusters nodes with similar attributes in the same partition to maintain semantic structure and keep multi-hop neighbor locality. Then streaming partitioning combined with attribute clustering is applied to subgraph assignment for alleviating memory overhead. After local GNNs training on distributed machines, we deploy cross-layer reconciliation strategies for heterogeneous local models to improve averaged global model by knowledge distillation and contrastive learning. Extensive experiments conducted on four large graph datasets in node classification and link prediction tasks, show that our model outperforms DistDGL with fewer resource requirements and almost 2x speedup of convergence.

EC-Graph: A Distributed Graph Neural Network System with Error-Compensated Compression

ByteGNN: Efficient Graph Neural Network Training at Large Scale

Distributed Training of Large Graph Neural Networks with Variable Communication Rates

GraphTheta: A Distributed Graph Neural Network Learning System With Flexible Training Strategy

GNN at the Edge: Cost-Efficient Graph Neural Network Processing over Distributed Edge Servers

Scalable Graph Compressed Convolutions

Understanding GNN Computational Graph: A Coordinated Computation, IO, and Memory Perspective

Accurate, Efficient and Scalable Graph Embedding

Scalable and Efficient Full-Graph GNN Training for Large Graphs

CATGNN: Cost-Efficient and Scalable Distributed Training for Graph Neural Networks

DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs

Eliminating Data Processing Bottlenecks in GNN Training over Large Graphs via Two-level Feature Compression

GraphScale: A Framework to Enable Machine Learning over Billion-node Graphs

Attribute-Driven Streaming Edge Partitioning with Reconciliations for Distributed Graph Neural Networks Training

Graph neural networks meet with distributed graph partitioners and reconciliations

Adaptive Message Quantization and Parallelization for Distributed Full-graph GNN Training

Boosting Distributed Full-graph GNN Training with Asynchronous One-bit Communication

SuperGCN: General and Scalable Framework for GCN Training on CPU-powered Supercomputers

Near-Lossless Gradient Compression for Data-Parallel Distributed DNN Training

Simplifying Distributed Neural Network Training on Massive Graphs: Randomized Partitions Improve Model Aggregation

Distributed Graph Neural Network Training: A Survey