Yingxia Shao,Hongzheng Li,Xizhi Gu,Hongbo Yin,Yawen Li,Xupeng Miao,Wentao Zhang,Bin Cui,Lei Chen
Abstract:Graph neural networks (GNNs) are a type of deep learning models that are trained on graphs and have been successfully applied in various domains. Despite the effectiveness of GNNs, it is still challenging for GNNs to efficiently scale to large graphs. As a remedy, distributed computing becomes a promising solution of training large-scale GNNs, since it is able to provide abundant computing resources. However, the dependency of graph structure increases the difficulty of achieving high-efficiency distributed GNN training, which suffers from the massive communication and workload imbalance. In recent years, many efforts have been made on distributed GNN training, and an array of training algorithms and systems have been proposed. Yet, there is a lack of systematic review on the optimization techniques for the distributed execution of GNN training. In this survey, we analyze three major challenges in distributed GNN training that are massive feature communication, the loss of model accuracy and workload imbalance. Then we introduce a new taxonomy for the optimization techniques in distributed GNN training that address the above challenges. The new taxonomy classifies existing techniques into four categories that are GNN data partition, GNN batch generation, GNN execution model, and GNN communication protocol. We carefully discuss the techniques in each category. In the end, we summarize existing distributed GNN systems for multi-GPUs, GPU-clusters and CPU-clusters, respectively, and give a discussion about the future direction on distributed GNN training.
Machine Learning,Artificial Intelligence,Databases,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges encountered in the training of large - scale graph neural networks (GNNs). Although GNNs have achieved remarkable success in multiple fields, when dealing with large - scale graph data, their computational resource requirements are extremely high, making it difficult for existing methods to expand efficiently. To solve this problem, distributed computing has become a potential solution, providing a large amount of computational resources to support the training of large - scale GNNs. However, due to the data - dependency of the graph structure, achieving efficient distributed GNN training faces huge challenges, mainly including:
1. **Massive Feature Communication**: During the distributed GNN training process, whether in the batch generation stage or the GNN model training stage, a large amount of feature communication is required. Especially in the full - graph training mode, the graph aggregation at each layer needs to access the hidden features of remote nodes, which leads to a large amount of hidden feature (or embedding) communication.
2. **Model Precision Loss**: Although the batch training method is more scalable than the full - graph training, as the model depth increases, accurate batch training will encounter the neighbor explosion problem. To improve training efficiency, approximate batch training methods such as sampling or ignoring edges across worker nodes are usually adopted. However, these methods cannot guarantee the theoretical convergence of the model, so a trade - off needs to be made between model precision and training efficiency.
3. **Workload Imbalance**: Workload balancing is an inherent problem in distributed computing, and the workload characteristics of the GNN model make it more difficult to balance the training workload among different worker nodes. Traditional graph partitioning algorithms cannot be directly applied to the balancing of GNN workloads because it is difficult to model GNN workloads in a simple and unified way. In addition, in distributed batch GNN training, each worker node needs to process the same number of batches and the batch size is the same, not just simply balancing the number of vertices in the sub - graph.
By analyzing these challenges, the paper proposes a new taxonomy, which divides the existing optimization techniques into four major categories: GNN data partitioning, GNN batch generation, GNN execution model, and GNN communication protocol. The techniques in each category aim to solve one or more of the above - mentioned challenges, thereby improving the efficiency and effectiveness of distributed GNN training.