Abstract:Graph neural networks (GNNs) are among the most powerful tools in deep learning. They routinely solve complex problems on unstructured networks, such as node classification, graph classification, or link prediction, with high accuracy. However, both inference and training of GNNs are complex, and they uniquely combine the features of irregular graph processing with dense and regular computations. This complexity makes it very challenging to execute GNNs efficiently on modern massively parallel architectures. To alleviate this, we first design a taxonomy of parallelism in GNNs, considering data and model parallelism, and different forms of pipelining. Then, we use this taxonomy to investigate the amount of parallelism in numerous GNN models, GNN-driven machine learning tasks, software frameworks, or hardware accelerators. We use the work-depth model, and we also assess communication volume and synchronization. We specifically focus on the sparsity/density of the associated tensors, in order to understand how to effectively apply techniques such as vectorization. We also formally analyze GNN pipelining, and we generalize the established Message-Passing class of GNN models to cover arbitrary pipeline depths, facilitating future optimizations. Finally, we investigate different forms of asynchronicity, navigating the path for future asynchronous parallel GNN pipelines. The outcomes of our analysis are synthesized in a set of insights that help to maximize GNN performance, and a comprehensive list of challenges and opportunities for further research into efficient GNN computations. Our work will help to advance the design of future GNNs.

TAP: Accelerating Large-Scale DNN Training Through Tensor Automatic Parallelisation

TAPS: Topology-Aware Intra-Operator Parallelism Strategy Searching Algorithm for Deep Neural Networks

TensorOpt: Exploring the Tradeoffs in Distributed DNN Training With Auto-Parallelism

GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

Nnscaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training.

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration

Efficient Schedule Construction for Distributed Execution of Large DNN Models

PaSE: Parallelization Strategies for Efficient DNN Training

Automatic Task Parallelization of Dataflow Graphs in ML/DL models

UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming

Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency Analysis

Beyond Data and Model Parallelism for Deep Neural Networks

A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs

Visual Diagnostics of Parallel Performance in Training Large-Scale DNN Models

TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators

Runtime Concurrency Control and Operation Scheduling for High Performance Neural Network Training

Optimal distributed parallel algorithms for deep learning framework Tensorflow