Abstract:Graph neural networks (GNNs) are one of the rapidly growing fields within deep learning. While many distributed GNN training frameworks have been proposed to increase the training throughput, they face three limitations when applied to multi-server clusters. 1) They suffer from an inter-server communication bottleneck because they do not consider the inter-/intra-server bandwidth gap, a representative characteristic of multi-server clusters. 2) Redundant memory usage and computation hinder the scalability of the distributed frameworks. 3) Sampling methods, de facto standard in mini-batch training, incur unnecessary errors in multi-server clusters. We found that these limitations can be addressed by exploiting the characteristics of multi-server clusters. Here, we propose GraNNDis, a fast distributed GNN training framework for multi-server clusters. Firstly, we present Flexible Preloading, which preloads the essential vertex dependencies server-wise to reduce the low-bandwidth inter-server communications. Secondly, we introduce Cooperative Batching, which enables memory-efficient, less redundant mini-batch training by utilizing high-bandwidth intra-server communications. Thirdly, we propose Expansion-aware Sampling, a cluster-aware sampling method, which samples the edges that affect the system speedup. As sampling the intra-server dependencies does not contribute much to the speedup as they are communicated through fast intra-server links, it only targets a server boundary to be sampled. Lastly, we introduce One-Hop Graph Masking, a computation and communication structure to realize the above methods in multi-server environments. We evaluated GraNNDis on multi-server clusters, and it provided significant speedup over the state-of-the-art distributed GNN training frameworks. GraNNDis is open-sourced at <a class="link-external link-https" href="https://github.com/AIS-SNU/GraNNDis_Artifact" rel="external noopener nofollow">this https URL</a> to facilitate its use.

Attribute-Driven Streaming Edge Partitioning with Reconciliations for Distributed Graph Neural Networks Training

Simplifying Distributed Neural Network Training on Massive Graphs: Randomized Partitions Improve Model Aggregation

Graph neural networks meet with distributed graph partitioners and reconciliations

DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs

An Experimental Comparison of Partitioning Strategies for Distributed Graph Neural Network Training

ByteGNN: Efficient Graph Neural Network Training at Large Scale

Entropy Aware Training for Fast and Accurate Distributed GNN

A Graph Neural Network Based Decentralized Learning Scheme

Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration

Boosting Distributed Full-graph GNN Training with Asynchronous One-bit Communication

Distributed Graph Neural Network Training: A Survey

CATGNN: Cost-Efficient and Scalable Distributed Training for Graph Neural Networks

Leiden-Fusion Partitioning Method for Effective Distributed Training of Graph Embeddings

Scalable and Efficient Full-Graph GNN Training for Large Graphs

GraNNDis: Efficient Unified Distributed Training Framework for Deep GNNs on Large Clusters

Adaptive Message Quantization and Parallelization for Distributed Full-graph GNN Training

MassiveGNN: Efficient Training via Prefetching for Massively Connected Distributed Graphs

D3-GNN: Dynamic Distributed Dataflow for Streaming Graph Neural Networks

GraphTheta: A Distributed Graph Neural Network Learning System With Flexible Training Strategy