Abstract:Graph neural networks (GNNs) are a type of deep learning models that are trained on graphs and have been successfully applied in various domains. Despite the effectiveness of GNNs, it is still challenging for GNNs to efficiently scale to large graphs. As a remedy, distributed computing becomes a promising solution of training large-scale GNNs, since it is able to provide abundant computing resources. However, the dependency of graph structure increases the difficulty of achieving high-efficiency distributed GNN training, which suffers from the massive communication and workload imbalance. In recent years, many efforts have been made on distributed GNN training, and an array of training algorithms and systems have been proposed. Yet, there is a lack of systematic review on the optimization techniques for the distributed execution of GNN training. In this survey, we analyze three major challenges in distributed GNN training that are massive feature communication, the loss of model accuracy and workload imbalance. Then we introduce a new taxonomy for the optimization techniques in distributed GNN training that address the above challenges. The new taxonomy classifies existing techniques into four categories that are GNN data partition, GNN batch generation, GNN execution model, and GNN communication protocol. We carefully discuss the techniques in each category. In the end, we summarize existing distributed GNN systems for multi-GPUs, GPU-clusters and CPU-clusters, respectively, and give a discussion about the future direction on distributed GNN training.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges encountered in the training of large - scale graph neural networks (GNNs). Although GNNs have achieved remarkable success in multiple fields, when dealing with large - scale graph data, their computational resource requirements are extremely high, making it difficult for existing methods to expand efficiently. To solve this problem, distributed computing has become a potential solution, providing a large amount of computational resources to support the training of large - scale GNNs. However, due to the data - dependency of the graph structure, achieving efficient distributed GNN training faces huge challenges, mainly including: 1. **Massive Feature Communication**: During the distributed GNN training process, whether in the batch generation stage or the GNN model training stage, a large amount of feature communication is required. Especially in the full - graph training mode, the graph aggregation at each layer needs to access the hidden features of remote nodes, which leads to a large amount of hidden feature (or embedding) communication. 2. **Model Precision Loss**: Although the batch training method is more scalable than the full - graph training, as the model depth increases, accurate batch training will encounter the neighbor explosion problem. To improve training efficiency, approximate batch training methods such as sampling or ignoring edges across worker nodes are usually adopted. However, these methods cannot guarantee the theoretical convergence of the model, so a trade - off needs to be made between model precision and training efficiency. 3. **Workload Imbalance**: Workload balancing is an inherent problem in distributed computing, and the workload characteristics of the GNN model make it more difficult to balance the training workload among different worker nodes. Traditional graph partitioning algorithms cannot be directly applied to the balancing of GNN workloads because it is difficult to model GNN workloads in a simple and unified way. In addition, in distributed batch GNN training, each worker node needs to process the same number of batches and the batch size is the same, not just simply balancing the number of vertices in the sub - graph. By analyzing these challenges, the paper proposes a new taxonomy, which divides the existing optimization techniques into four major categories: GNN data partitioning, GNN batch generation, GNN execution model, and GNN communication protocol. The techniques in each category aim to solve one or more of the above - mentioned challenges, thereby improving the efficiency and effectiveness of distributed GNN training.

Distributed Graph Neural Network Training: A Survey

Distributed Graph Neural Network Training: A Survey

A Comprehensive Survey on Distributed Training of Graph Neural Networks

ByteGNN: Efficient Graph Neural Network Training at Large Scale

Boosting Distributed Full-graph GNN Training with Asynchronous One-bit Communication

Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration

DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs

Scalable and Efficient Full-Graph GNN Training for Large Graphs

Distributed Training of Large Graph Neural Networks with Variable Communication Rates

NeutronTP: Load-Balanced Distributed Full-Graph GNN Training with Tensor Parallelism

The Evolution of Distributed Systems for Graph Neural Networks and their Origin in Graph Processing and Deep Learning: A Survey

Fully Distributed Online Training of Graph Neural Networks in Networked Systems

Acceleration Algorithms in GNNs: A Survey

Graph neural networks meet with distributed graph partitioners and reconciliations

Simplifying Distributed Neural Network Training on Massive Graphs: Randomized Partitions Improve Model Aggregation

CATGNN: Cost-Efficient and Scalable Distributed Training for Graph Neural Networks

GraphTheta: A Distributed Graph Neural Network Learning System With Flexible Training Strategy

POSTER: ParGNN: Efficient Training for Large-Scale Graph Neural Network on GPU Clusters

Scalable Neural Network Training over Distributed Graphs

A Survey of Distributed Graph Algorithms on Massive Graphs

A Comprehensive Study on Large-Scale Graph Training: Benchmarking and Rethinking