Abstract:Present distributed graph training frameworks evenly partition a large graph into small chunks to suit for distributed storage, leverage an independent interface to access neighbors, and train Graph Neural Networks (GNNs) in a cluster of machines to update weights. Nevertheless, they consider a separate design of storage and training, taking huge communication cost for retrieving neighborhoods. During the storage phase, traditional heuristic graph partitioning not only suffers from memory overhead because of loading full graph into the memory, but also neglects meaningful node attributes to assist node or edge assignment by grouping semantically related structures (e.g., user interests, item categories in the recommendation graph). What's more, in the weight update phrase, directly averaging synchronization is difficult to tackle with heterogeneous local models where each machine's data is load from different subgraphs, resulting in slow convergence. To solve these problems, we propose a novel distributed graph training approach, \textit{Attribute-driven Streaming Edge Partitioning with Reconciliations}, where the local model loads only subgraph stored on its own machine to make fewer communications. ASEPR firstly clusters nodes with similar attributes in the same partition to maintain semantic structure and keep multi-hop neighbor locality. Then streaming partitioning combined with attribute clustering is applied to subgraph assignment for alleviating memory overhead. After local GNNs training on distributed machines, we deploy cross-layer reconciliation strategies for heterogeneous local models to improve averaged global model by knowledge distillation and contrastive learning. Extensive experiments conducted on four large graph datasets in node classification and link prediction tasks, show that our model outperforms DistDGL with fewer resource requirements and almost 2x speedup of convergence.

Geryon: Accelerating Distributed CNN Training by Network-Level Flow Scheduling

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration

Boosting Distributed Full-graph GNN Training with Asynchronous One-bit Communication

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes

DyGA: A Hardware-Efficient Accelerator with Traffic-Aware Dynamic Scheduling for Graph Convolutional Networks.

SAP-SGD: Accelerating Distributed Parallel Training with High Communication Efficiency on Heterogeneous Clusters

Collaborative edge computing for distributed CNN inference acceleration using receptive field-based segmentation

A Unified CPU-GPU Protocol for GNN Training

ByteGNN: Efficient Graph Neural Network Training at Large Scale

GNNPipe: Scaling Deep GNN Training with Pipelined Model Parallelism

DynaComm: Accelerating Distributed CNN Training between Edges and Clouds through Dynamic Communication Scheduling

Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems

BatchGNN: Efficient CPU-Based Distributed GNN Training on Very Large Graphs

Priority-based Parameter Propagation for Distributed DNN Training

Slicing Input Features to Accelerate Deep Learning: A Case Study with Graph Neural Networks

HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training.

Scaling Deep Learning on GPU and Knights Landing clusters

GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism

Attribute-Driven Streaming Edge Partitioning with Reconciliations for Distributed Graph Neural Networks Training

Fast and accurate variable batch size convolution neural network training on large scale distributed systems