Abstract:Supernet training, a prevalent and important paradigm in Neural Architecture Search, embeds the whole DNN architecture search space into one monolithic supernet, iteratively activates a subset of the supernet (i.e., a subnet) for fitting each batch of data, and searches a high-quality subnet which meets specific requirements. Although training subnets in parallel on multiple GPUs is desirable for acceleration, there inherently exists a race hazard that concurrent subnets may access the same DNN layers. Existing systems support neither efficiently parallelizing subnets’ training executions, nor resolving the race hazard deterministically, leading to unreproducible training procedures and potentiallly non-trivial accuracy loss. We present NASPipe, the first high-performance and reproducible distributed supernet training system via causal synchronous parallel (CSP) pipeline scheduling abstraction: NASPipe partitions a supernet across GPUs and concurrently executes multiple generated sub-tasks (subnets) in a pipelined manner; meanwhile, it oversees the correlations between the subnets and deterministically resolves any causal dependency caused by subnets’ layer sharing. To obtain high performance, NASPipe’s CSP scheduler exploits the fact that the larger a supernet spans, the fewer dependencies manifest between chronologically close subnets; therefore, it aggressively schedules the subnets with larger chronological orders into execution, only if they are not causally dependent on unfinished precedent subnets. Moreover, to relieve the excessive GPU memory burden for holding the whole supernet’s parameters, NASPipe uses a context switch technique that stashes the whole supernet in CPU memory, precisely predicts the subnets’ schedule, and pre-fetches/evicts a subnet before/after its execution. The evaluation shows that NASPipe is the only system that retains supernet training reproducibility, while achieving a comparable and even higher performance (up to 7.8X) compared to three recent pipeline training systems (e.g., GPipe).

Choice of Parallelism: Multi-GPU Driven Pipeline for Huge Academic Backbone Network

Efficient Modeling and Real-Time Rendering of Massive Urban Pipelines Based on GPU

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster

Large Scale Multi-GPU Based Parallel Traffic Simulation for Accelerated Traffic Assignment and Propagation

GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

Apache Spark Accelerated Deep Learning Inference for Large Scale Satellite Image Analytics

Hybrid CPU-GPU Framework for Network Motifs

Analyzing the Performance of Graph Neural Networks with Pipe Parallelism

NASPipe: high performance and reproducible pipeline parallel supernet training via causal synchronous parallelism

High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platforms

GPUSCAN: GPU-Based Parallel Structural Clustering Algorithm for Networks

Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Accelerating Large Scale Image Analyses on Parallel, CPU-GPU Equipped Systems

BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training

A Scalable Software Framework for Stateful Stream Data Processing on Multiple GPUs and Applications

High-Performance Massive Subgraph Counting Using Pipelined Adaptive-Group Communication.

Building Near-Real-Time Processing Pipelines with the Spark-MPI Platform

SDPipe: A Semi-Decentralized Framework for Heterogeneity-aware Pipeline-parallel Training.