Abstract:Graph Neural Networks (GNNs) have emerged as powerful tools for supervised machine learning over graph-structured data, while sampling-based node representation learning is widely utilized in unsupervised learning. However, scalability remains a major challenge in both supervised and unsupervised learning for large graphs (e.g., those with over 1 billion nodes). The scalability bottleneck largely stems from the mini-batch sampling phase in GNNs and the random walk sampling phase in unsupervised methods. These processes often require storing features or embeddings in memory. In the context of distributed training, they require frequent, inefficient random access to data stored across different workers. Such repeated inter-worker communication for each mini-batch leads to high communication overhead and computational inefficiency.
We propose GraphScale, a unified framework for both supervised and unsupervised learning to store and process large graph data distributedly. The key insight in our design is the separation of workers who store data and those who perform the training. This separation allows us to decouple computing and storage in graph training, thus effectively building a pipeline where data fetching and data computation can overlap asynchronously. Our experiments show that GraphScale outperforms state-of-the-art methods for distributed training of both GNNs and node embeddings. We evaluate GraphScale both on public and proprietary graph datasets and observe a reduction of at least 40% in end-to-end training times compared to popular distributed frameworks, without any loss in performance. While most existing methods don't support billion-node graphs for training node embeddings, GraphScale is currently deployed in production at TikTok enabling efficient learning over such large graphs.
Machine Learning,Distributed, Parallel, and Cluster Computing,Social and Information Networks
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the scalability challenges of large - scale graph data (for example, graphs containing more than 1 billion nodes) in machine learning. Specifically, the paper focuses on the bottleneck problems faced by Graph Neural Networks (GNNs) and sampling - based node representation learning methods when processing large graphs.
### Main Problems
1. **Scalability Issues in GNN Training**:
- In supervised learning, there is a key bottleneck in the GNN training process: the sampling phase of mini - batch training. For each vertex in a mini - batch, the sampler needs to sample from its neighbor nodes (and their corresponding edges) and obtain features. Due to a large number of random data accesses and remote feature acquisitions, especially when training large graphs in a distributed manner, this process may take more time than the training phase.
- High communication cost: In distributed training, frequent cross - worker - node data accesses lead to high communication overhead and low computational efficiency.
2. **Scalability Issues in Node Embedding Training**:
- For unsupervised learning, the main challenge in training node embeddings is the problem of model size. For example, for a graph containing 1 billion nodes with each node having an embedding dimension of 256, the embedding matrix alone requires 512 GB of memory (using 16 - bit precision floating - point numbers). Traditional distributed training methods rely on data parallelism, which requires storing the entire model in memory and a large amount of communication when averaging gradients after each iteration.
- Storage bottleneck: Storing the entire embedding matrix requires a large amount of memory.
- Communication bottleneck: A large amount of communication is required when averaging gradients in each iteration.
### Solutions
To address the above challenges, the paper proposes the GraphScale framework, which aims to solve these problems in the following ways:
- **Separate Computation and Storage**: By separating the worker nodes that store data from those that perform training, the number of feature requests is reduced, and feature acquisition and computation are allowed to overlap, thereby alleviating the communication bottleneck.
- **Hybrid Parallelism**: Use a combination of data parallelism and model parallelism to train node embeddings, thereby independently scaling computational and storage capabilities and reducing communication and storage requirements.
- **Serverless Architecture**: With the help of the Ray framework, GraphScale achieves elastic resource allocation and fault - tolerance capabilities, simplifying resource management.
### Experimental Results
Experiments show that when processing graphs containing billions of nodes, GraphScale reduces the end - to - end training time by at least 40% compared to existing distributed training frameworks (such as DistDGL and GraphLearn), without a performance degradation. In addition, GraphScale has been deployed in TikTok's actual production environment and can efficiently handle large - scale graph data containing billions of nodes.
In conclusion, GraphScale significantly improves the scalability and efficiency of machine learning for large - scale graph data by optimizing feature acquisition, introducing hybrid parallelism, and adopting a serverless architecture.