Abstract:As many real-world applications are streaming and attached with time instances, a few works have been proposed to learn streaming graph neural networks (GNNs). Unfortunately, current streaming GNNs are observed to have a large training overhead and suffer from bad parallel scalability on multiple GPUs. These drawbacks pose severe challenges to online learning of streaming GNNs and their application to real-time scenarios. To improve training efficiency, one promising solution is to use sampling, a technique widely used in static GNNs. However, to the best of our knowledge, sampling has not been investigated in learning streaming GNNs. Based on these observations, in this paper, we propose T-GCN, the first sampling-based streaming GNN system, which targets temporal-aware streaming graphs and takes advantage of a hybrid CPU-GPU co-processing architecture to achieve high throughput and low latency. T-GCN proposes an efficient sampling method, namely Segment Its Search , to offer high sampling speed with respect to three typical types of general graph sampling methods (i.e., node-wise, layer-wise, and subgraph sampling). We propose a locality-aware data partitioning method to reduce CPU-GPU communication latency and data transfer overhead, and an NVLink-specific task schedule to fully exploit NVLink's fast speed and improve GPU-GPU communication efficiency. Besides, we further pipeline the computation and the communication by introducing an efficient memory management mechanism, to improve scalability while hiding data communication. Overall, with respect to end-to-end performance, for single-GPU training, T-GCN achieves up to 7.9× speedup than state-of-the-art works. In terms of scalability, T-GCN runs 5.2× faster on average with 8 GPUs than one GPU. Additionally, in terms of sampling, T-GCN also yields a maximum of 38.8× speedup with our Segment Its Search sampling method.

Scalable Graph Sampling on GPUs with Compressed Graph.

Efficient Data Loader for Fast Sampling-Based GNN Training on Large Graphs.

Accurate, Efficient and Scalable Graph Embedding

Optimizing GPU-based Graph Sampling and Random Walk for Efficiency and Scalability

PaGraph

DyGA: A Hardware-Efficient Accelerator with Traffic-Aware Dynamic Scheduling for Graph Convolutional Networks.

FastGL: A GPU-Efficient Framework for Accelerating Sampling-Based GNN Training at Large Scale

Adaptive Sampling Towards Fast Graph Representation Learning

GRAPHIC: GatheR-And-Process in Highly Parallel with In-SSD Compression Architecture in Very Large-Scale Graph

T-GCN: A Sampling Based Streaming Graph Neural Network System with Hybrid Architecture.

Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture

SGSI – A Scalable GPU-Friendly Subgraph Isomorphism Algorithm

Garaph: Efficient GPU-accelerated Graph Processing on a Single Machine with Balanced Replication.

Connectivity-Based Segmentation for GPU-Accelerated Mesh Decompression

HyTGraph: GPU-Accelerated Graph Processing with Hybrid Transfer Management

Graph-based Scalable Sampling of 3D Point Cloud Attributes

GPUSCAN$^{++}$:Efficient Structural Graph Clustering on GPUs

CuLDA_CGS: Solving Large-scale LDA Problems on GPUs

Empirical analysis of performance bottlenecks in graph neural network training and inference with GPUs

Graph Sampling with Fast Random Walker on HBM-enabled FPGA Accelerators.

Learning by Sampling and Compressing: Efficient Graph Representation Learning with Extremely Limited Annotations