Adaptive Message Quantization and Parallelization for Distributed Full-graph GNN Training

Borui Wan,Juntao Zhao,Chuan Wu
2023-06-02
Abstract:Distributed full-graph training of Graph Neural Networks (GNNs) over large graphs is bandwidth-demanding and time-consuming. Frequent exchanges of node features, embeddings and embedding gradients (all referred to as messages) across devices bring significant communication overhead for nodes with remote neighbors on other devices (marginal nodes) and unnecessary waiting time for nodes without remote neighbors (central nodes) in the training graph. This paper proposes an efficient GNN training system, AdaQP, to expedite distributed full-graph GNN training. We stochastically quantize messages transferred across devices to lower-precision integers for communication traffic reduction and advocate communication-computation parallelization between marginal nodes and central nodes. We provide theoretical analysis to prove fast training convergence (at the rate of O(T^{-1}) with T being the total number of training epochs) and design an adaptive quantization bit-width assignment scheme for each message based on the analysis, targeting a good trade-off between training convergence and efficiency. Extensive experiments on mainstream graph datasets show that AdaQP substantially improves distributed full-graph training's throughput (up to 3.01 X) with negligible accuracy drop (at most 0.30%) or even accuracy improvement (up to 0.19%) in most cases, showing significant advantages over the state-of-the-art works.
Machine Learning,Artificial Intelligence,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The paper aims to address the issues of high bandwidth demand and long training times in distributed full-graph Graph Neural Network (GNN) training on large-scale graphs. Specifically, the paper focuses on the communication overhead caused by the frequent exchange of node features, embeddings, and their gradients (collectively referred to as messages) during training. For edge nodes (whose neighbors are on other devices), this frequent message exchange results in significant communication overhead; for central nodes (whose neighbors are on the local device), it leads to unnecessary waiting time. To solve the above problems, the authors propose an efficient GNN training system called AdaQP, which accelerates distributed full-graph GNN training through the following two main aspects: 1. **Adaptive Quantization of Messages**: Reducing the amount of communication data by stochastically quantizing the messages transmitted across devices. 2. **Parallelization of Computation and Communication**: Overlapping the computation of central nodes with the message transmission of edge nodes on each device to maximize training speed and resource utilization. The paper also provides theoretical analysis, proving that AdaQP ensures fast convergence and designs an adaptive quantization bit-width allocation scheme to achieve a good balance between training convergence and efficiency. Experimental results show that AdaQP significantly reduces communication time and improves training throughput, with almost no loss in accuracy and even some improvements.