T-GCN: A Sampling Based Streaming Graph Neural Network System with Hybrid Architecture.

Chengying Huan,Shuaiwen Leon Song,Yongchao Liu,Heng Zhang,Hang Liu,Charles He,Kang Chen,Jinlei Jiang,Yongwei Wu
DOI: https://doi.org/10.1145/3559009.3569648
2022-01-01
Abstract:As many real-world applications are streaming and attached with time instances, a few works have been proposed to learn streaming graph neural networks (GNNs). Unfortunately, current streaming GNNs are observed to have a large training overhead and suffer from bad parallel scalability on multiple GPUs. These drawbacks pose severe challenges to online learning of streaming GNNs and their application to real-time scenarios. To improve training efficiency, one promising solution is to use sampling, a technique widely used in static GNNs. However, to the best of our knowledge, sampling has not been investigated in learning streaming GNNs. Based on these observations, in this paper, we propose T-GCN, the first sampling-based streaming GNN system, which targets temporal-aware streaming graphs and takes advantage of a hybrid CPU-GPU co-processing architecture to achieve high throughput and low latency. T-GCN proposes an efficient sampling method, namely Segment Its Search , to offer high sampling speed with respect to three typical types of general graph sampling methods (i.e., node-wise, layer-wise, and subgraph sampling). We propose a locality-aware data partitioning method to reduce CPU-GPU communication latency and data transfer overhead, and an NVLink-specific task schedule to fully exploit NVLink's fast speed and improve GPU-GPU communication efficiency. Besides, we further pipeline the computation and the communication by introducing an efficient memory management mechanism, to improve scalability while hiding data communication. Overall, with respect to end-to-end performance, for single-GPU training, T-GCN achieves up to 7.9× speedup than state-of-the-art works. In terms of scalability, T-GCN runs 5.2× faster on average with 8 GPUs than one GPU. Additionally, in terms of sampling, T-GCN also yields a maximum of 38.8× speedup with our Segment Its Search sampling method.
What problem does this paper attempt to address?