Abstract:Graph neural networks (GNNs) are a type of deep learning models that are trained on graphs and have been successfully applied in various domains. Despite the effectiveness of GNNs, it is still challenging for GNNs to efficiently scale to large graphs. As a remedy, distributed computing becomes a promising solution of training large-scale GNNs, since it is able to provide abundant computing resources. However, the dependency of graph structure increases the difficulty of achieving high-efficiency distributed GNN training, which suffers from the massive communication and workload imbalance. In recent years, many efforts have been made on distributed GNN training, and an array of training algorithms and systems have been proposed. Yet, there is a lack of systematic review on the optimization techniques for the distributed execution of GNN training. In this survey, we analyze three major challenges in distributed GNN training that are massive feature communication, the loss of model accuracy and workload imbalance. Then we introduce a new taxonomy for the optimization techniques in distributed GNN training that address the above challenges. The new taxonomy classifies existing techniques into four categories that are GNN data partition, GNN batch generation, GNN execution model, and GNN communication protocol. We carefully discuss the techniques in each category. In the end, we summarize existing distributed GNN systems for multi-GPUs, GPU-clusters and CPU-clusters, respectively, and give a discussion about the future direction on distributed GNN training.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper "A Survey of Distributed Graph Neural Network Training" aims to solve the challenges in the distributed training of large - scale graph neural networks (GNNs). Although GNNs perform excellently in processing graph - structured data, when dealing with large - scale graph data, traditional single - machine training methods face bottlenecks in memory and computing resources. Therefore, distributed computing has become a promising solution, which can provide abundant computing resources to train large - scale GNNs. However, the dependency of graph data increases the difficulty of achieving efficient distributed GNN training, mainly manifested in the following aspects: 1. **A large amount of feature communication**: - In distributed GNN training, whether it is mini - batch training or full - graph training, feature communication needs to be carried out frequently. For example, in mini - batch training, remote graphs and features need to be accessed when generating batches; in full - graph training, the graph aggregation operation at each layer needs to access the hidden features of remote neighbors, resulting in a large amount of feature communication. 2. **Loss of model accuracy**: - Although mini - batch GNN training is more scalable, as the model depth increases, it will face the neighbor explosion problem. The usual solution is to construct an approximate mini - batch by sampling or ignoring edges across worker nodes, but this may not guarantee the convergence of the model. Therefore, a trade - off needs to be made between model accuracy and training efficiency. In addition, full - graph distributed GNN training has gradually attracted attention because it can ensure the convergence of the model and achieve the same model accuracy as single - machine training. 3. **Workload imbalance**: - Workload balance is an inherent problem in distributed computing. Due to the different workload characteristics of GNN models, it is more difficult to balance the training workload among worker nodes. Existing classic graph partitioning algorithms are difficult to be directly applied to the balance of GNN workloads. In addition, mini - batch distributed GNN training requires each worker node to process the same number of mini - batches, and each mini - batch has the same size, which further increases the difficulty of workload balance. To address these challenges, the paper proposes a new taxonomy, which divides existing optimization techniques into four categories: GNN data partitioning, GNN batch generation, GNN execution model, and GNN communication protocol. Through this taxonomy, the paper systematically reviews various optimization techniques and discusses their applications and effects at different stages. Finally, the paper also summarizes the existing distributed GNN systems and looks forward to future research directions.

Distributed Graph Neural Network Training: A Survey

Distributed Graph Neural Network Training: A Survey

A Comprehensive Survey on Distributed Training of Graph Neural Networks

ByteGNN: Efficient Graph Neural Network Training at Large Scale

Boosting Distributed Full-graph GNN Training with Asynchronous One-bit Communication

Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration

DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs

Scalable and Efficient Full-Graph GNN Training for Large Graphs

Distributed Training of Large Graph Neural Networks with Variable Communication Rates

The Evolution of Distributed Systems for Graph Neural Networks and their Origin in Graph Processing and Deep Learning: A Survey

Fully Distributed Online Training of Graph Neural Networks in Networked Systems

Acceleration Algorithms in GNNs: A Survey

Graph neural networks meet with distributed graph partitioners and reconciliations

Simplifying Distributed Neural Network Training on Massive Graphs: Randomized Partitions Improve Model Aggregation

CATGNN: Cost-Efficient and Scalable Distributed Training for Graph Neural Networks

GraphTheta: A Distributed Graph Neural Network Learning System With Flexible Training Strategy

Scalable Neural Network Training over Distributed Graphs

A Survey of Distributed Graph Algorithms on Massive Graphs

A Comprehensive Study on Large-Scale Graph Training: Benchmarking and Rethinking