BLAD: Adaptive Load Balanced Scheduling and Operator Overlap Pipeline for Accelerating the Dynamic GNN Training

Kaihua Fu,Quan Chen,Yuzhuo Yang,Jiuchen Shi,Chao Li,Minyi Guo
DOI: https://doi.org/10.1145/3581784.3607040
2023-01-01
Abstract:Dynamic graph networks are widely used for learning time-evolving graphs, but prior work on training these networks is inefficient due to communication overhead, long synchronization, and poor resource usage. Our investigation shows that communication and synchronization can be reduced by carefully scheduling the workload. And the execution order of operators in GNNs can be adjusted without hurting training convergence. We propose a system called BLAD to consider the above factors, comprising a two-level load scheduler and an overlap-aware topology manager. The scheduler allocates each snapshot group to a GPU, alleviating cross-GPU communication. The snapshots in a group are then carefully allocated to processes on a GPU, enabling overlap of compute-intensive NN operators and memory-intensive graph operators. The topology manager adjusts the operators' execution order to maximize the overlap. Experiments show that BLAD achieves 27.2% speed up on training time on average without affecting final accuracy, compared to state-of-the-art solutions.
What problem does this paper attempt to address?