Abstract:Distributed training of GNNs enables learning on massive graphs (e.g., social and e-commerce networks) that exceed the storage and computational capacity of a single machine. To reach performance comparable to centralized training, distributed frameworks focus on maximally recovering cross-instance node dependencies with either communication across instances or periodic fallback to centralized training, which create overhead and limit the framework scalability. In this work, we present a simplified framework for distributed GNN training that does not rely on the aforementioned costly operations, and has improved scalability, convergence speed and performance over the state-of-the-art approaches. Specifically, our framework (1) assembles independent trainers, each of which asynchronously learns a local model on locally-available parts of the training graph, and (2) only conducts periodic (time-based) model aggregation to synchronize the local models. Backed by our theoretical analysis, instead of maximizing the recovery of cross-instance node dependencies -- which has been considered the key behind closing the performance gap between model aggregation and centralized training -- , our framework leverages randomized assignment of nodes or super-nodes (i.e., collections of original nodes) to partition the training graph such that it improves data uniformity and minimizes the discrepancy of gradient and loss function across instances. In our experiments on social and e-commerce networks with up to 1.3 billion edges, our proposed RandomTMA and SuperTMA approaches -- despite using less training data -- achieve state-of-the-art performance and 2.31x speedup compared to the fastest baseline, and show better robustness to trainer failures.

Bbtopk: Bandwidth-Aware Sparse Allreduce with Blocked Sparsification for Efficient Distributed Training

Near-Optimal Sparse Allreduce for Distributed Deep Learning

SparDL: Distributed Deep Learning Training with Efficient Sparse Communication

S2 Reducer: High-Performance Sparse Communication to Accelerate Distributed Deep Learning

An Efficient Bandwidth-Adaptive Gradient Compression Algorithm for Distributed Training of Deep Neural Networks

Preserving Near-Optimal Gradient Sparsification Cost for Scalable Distributed Deep Learning

Communication-Efficient Distributed Learning via Sparse and Adaptive Stochastic Gradient

Bandwidth Reduction using Importance Weighted Pruning on Ring AllReduce

Sparse Communication for Training Deep Networks

Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity

Peering Beyond the Gradient Veil with Distributed Auto Differentiation

Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

Simplifying Distributed Neural Network Training on Massive Graphs: Randomized Partitions Improve Model Aggregation

Sparse Gradient Compression For Distributed Sgd

A Distributed SGD Algorithm with Global Sketching for Deep Learning Training Acceleration

Is Network the Bottleneck of Distributed Training?

Efficient Neural Network Training Via Forward and Backward Propagation Sparsification

Near-Optimal Topology-adaptive Parameter Synchronization in Distributed DNN Training

Distributed Deep Neural Network Training with Important Gradient Filtering, Delayed Update and Static Filtering.

From promise to practice: realizing high-performance decentralized training

Flexible Communication for Optimal Distributed Learning over Unpredictable Networks