Abstract:Graph Neural Networks (GNNs) have emerged as powerful tools for supervised machine learning over graph-structured data, while sampling-based node representation learning is widely utilized in unsupervised learning. However, scalability remains a major challenge in both supervised and unsupervised learning for large graphs (e.g., those with over 1 billion nodes). The scalability bottleneck largely stems from the mini-batch sampling phase in GNNs and the random walk sampling phase in unsupervised methods. These processes often require storing features or embeddings in memory. In the context of distributed training, they require frequent, inefficient random access to data stored across different workers. Such repeated inter-worker communication for each mini-batch leads to high communication overhead and computational inefficiency. We propose GraphScale, a unified framework for both supervised and unsupervised learning to store and process large graph data distributedly. The key insight in our design is the separation of workers who store data and those who perform the training. This separation allows us to decouple computing and storage in graph training, thus effectively building a pipeline where data fetching and data computation can overlap asynchronously. Our experiments show that GraphScale outperforms state-of-the-art methods for distributed training of both GNNs and node embeddings. We evaluate GraphScale both on public and proprietary graph datasets and observe a reduction of at least 40% in end-to-end training times compared to popular distributed frameworks, without any loss in performance. While most existing methods don't support billion-node graphs for training node embeddings, GraphScale is currently deployed in production at TikTok enabling efficient learning over such large graphs.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the scalability challenges of large - scale graph data (for example, graphs containing more than 1 billion nodes) in machine learning. Specifically, the paper focuses on the bottleneck problems faced by Graph Neural Networks (GNNs) and sampling - based node representation learning methods when processing large graphs. ### Main Problems 1. **Scalability Issues in GNN Training**: - In supervised learning, there is a key bottleneck in the GNN training process: the sampling phase of mini - batch training. For each vertex in a mini - batch, the sampler needs to sample from its neighbor nodes (and their corresponding edges) and obtain features. Due to a large number of random data accesses and remote feature acquisitions, especially when training large graphs in a distributed manner, this process may take more time than the training phase. - High communication cost: In distributed training, frequent cross - worker - node data accesses lead to high communication overhead and low computational efficiency. 2. **Scalability Issues in Node Embedding Training**: - For unsupervised learning, the main challenge in training node embeddings is the problem of model size. For example, for a graph containing 1 billion nodes with each node having an embedding dimension of 256, the embedding matrix alone requires 512 GB of memory (using 16 - bit precision floating - point numbers). Traditional distributed training methods rely on data parallelism, which requires storing the entire model in memory and a large amount of communication when averaging gradients after each iteration. - Storage bottleneck: Storing the entire embedding matrix requires a large amount of memory. - Communication bottleneck: A large amount of communication is required when averaging gradients in each iteration. ### Solutions To address the above challenges, the paper proposes the GraphScale framework, which aims to solve these problems in the following ways: - **Separate Computation and Storage**: By separating the worker nodes that store data from those that perform training, the number of feature requests is reduced, and feature acquisition and computation are allowed to overlap, thereby alleviating the communication bottleneck. - **Hybrid Parallelism**: Use a combination of data parallelism and model parallelism to train node embeddings, thereby independently scaling computational and storage capabilities and reducing communication and storage requirements. - **Serverless Architecture**: With the help of the Ray framework, GraphScale achieves elastic resource allocation and fault - tolerance capabilities, simplifying resource management. ### Experimental Results Experiments show that when processing graphs containing billions of nodes, GraphScale reduces the end - to - end training time by at least 40% compared to existing distributed training frameworks (such as DistDGL and GraphLearn), without a performance degradation. In addition, GraphScale has been deployed in TikTok's actual production environment and can efficiently handle large - scale graph data containing billions of nodes. In conclusion, GraphScale significantly improves the scalability and efficiency of machine learning for large - scale graph data by optimizing feature acquisition, introducing hybrid parallelism, and adopting a serverless architecture.

GraphScale: A Framework to Enable Machine Learning over Billion-node Graphs

ByteGNN: Efficient Graph Neural Network Training at Large Scale

Scalable and Efficient Full-Graph GNN Training for Large Graphs

DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs

Efficient scaling of dynamic graph neural networks

GraphTheta: A Distributed Graph Neural Network Learning System With Flexible Training Strategy

GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism

CATGNN: Cost-Efficient and Scalable Distributed Training for Graph Neural Networks

HyScale-GNN: A Scalable Hybrid GNN Training System on Single-Node Heterogeneous Architecture

Accurate, Efficient and Scalable Graph Embedding

A Comprehensive Study on Large-Scale Graph Training: Benchmarking and Rethinking

Scaling Up Graph Neural Networks Via Graph Coarsening

Towards Efficient Large-Scale Graph Neural Network Computing.

Efficient Graph Neural Network Inference at Large Scale

Distributed Matrix-Based Sampling for Graph Neural Network Training

MassiveGNN: Efficient Training via Prefetching for Massively Connected Distributed Graphs

Haste Makes Waste: A Simple Approach for Scaling Graph Neural Networks

Graph Batch Coarsening Framework for Scalable Graph Neural Networks

Sketch-GNN: Scalable Graph Neural Networks with Sublinear Training Complexity

ScaleNet: Scale Invariance Learning in Directed Graphs

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs