Abstract:Graph Neural Networks (GNNs) have emerged as powerful tools for supervised machine learning over graph-structured data, while sampling-based node representation learning is widely utilized in unsupervised learning. However, scalability remains a major challenge in both supervised and unsupervised learning for large graphs (e.g., those with over 1 billion nodes). The scalability bottleneck largely stems from the mini-batch sampling phase in GNNs and the random walk sampling phase in unsupervised methods. These processes often require storing features or embeddings in memory. In the context of distributed training, they require frequent, inefficient random access to data stored across different workers. Such repeated inter-worker communication for each mini-batch leads to high communication overhead and computational inefficiency. We propose GraphScale, a unified framework for both supervised and unsupervised learning to store and process large graph data distributedly. The key insight in our design is the separation of workers who store data and those who perform the training. This separation allows us to decouple computing and storage in graph training, thus effectively building a pipeline where data fetching and data computation can overlap asynchronously. Our experiments show that GraphScale outperforms state-of-the-art methods for distributed training of both GNNs and node embeddings. We evaluate GraphScale both on public and proprietary graph datasets and observe a reduction of at least 40% in end-to-end training times compared to popular distributed frameworks, without any loss in performance. While most existing methods don't support billion-node graphs for training node embeddings, GraphScale is currently deployed in production at TikTok enabling efficient learning over such large graphs.

Graph Attention Neural Network Distributed Model Training

Efficient Large-Scale Language Model Training on GPU Clusters

DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs

Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

GraphScale: A Framework to Enable Machine Learning over Billion-node Graphs

Distributed Training of Large Graph Neural Networks with Variable Communication Rates

Simplifying Distributed Neural Network Training on Massive Graphs: Randomized Partitions Improve Model Aggregation

ByteGNN: Efficient Graph Neural Network Training at Large Scale

Towards Training Billion Parameter Graph Neural Networks for Atomic Simulations

Distributed Graph Neural Network Training: A Survey

Boosting Distributed Full-graph GNN Training with Asynchronous One-bit Communication

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

Sampling-based Distributed Training with Message Passing Neural Network

Graph Attention MLP with Reliable Label Utilization

Towards Efficient Large-Scale Graph Neural Network Computing.

CATGNN: Cost-Efficient and Scalable Distributed Training for Graph Neural Networks

Learn Locally, Correct Globally: A Distributed Algorithm for Training Graph Neural Networks

GraphTheta: A Distributed Graph Neural Network Learning System With Flexible Training Strategy

Distributed Matrix-Based Sampling for Graph Neural Network Training

Efficient scaling of dynamic graph neural networks