Attribute-driven streaming edge partitioning with reconciliations for distributed graph neural network training

Zongshen Mu,Siliang Tang,Yueting Zhuang,Dianhai Yu
DOI: https://doi.org/10.1016/j.neunet.2023.06.026
Abstract:Current distributed graph training frameworks evenly partition a large graph into small chunks to suit distributed storage, leverage a uniform interface to access neighbors, and train graph neural networks in a cluster of machines to update weights. Nevertheless, they consider a separate design of storage and training, resulting in huge communication costs for retrieving neighborhoods. During the storage phase, traditional heuristic graph partitioning not only suffers from memory overhead because of loading the full graph into the memory but also damages semantically related structures because of its neglecting meaningful node attributes. What is more, in the weight-update phase, directly averaging synchronization is difficult to tackle with heterogeneous local models where each machine's data are loaded from different subgraphs, resulting in slow convergence. To solve these problems, we propose a novel distributed graph training approach, attribute-driven streaming edge partitioning with reconciliations (ASEPR), where the local model loads only the subgraph stored on its own machine to make fewer communications. ASEPR firstly clusters nodes with similar attributes in the same partition to maintain semantic structure and keep multihop neighbor locality. Then streaming partitioning combined with attribute clustering is applied to subgraph assignment to alleviate memory overhead. After local graph neural network training on distributed machines, we deploy cross-layer reconciliation strategies for heterogeneous local models to improve the averaged global model by knowledge distillation and contrastive learning. Extensive experiments conducted on four large graph datasets on node classification and link prediction tasks show that our model outperforms DistDGL, with fewer resource requirements and up to quadruple the convergence speed.
What problem does this paper attempt to address?