Abstract:Deep learning (DL) has achieved notable successes in many machine learning tasks. A number of frameworks have been developed to expedite the process of designing and training deep neural networks (DNNs), such as Caffe, Torch and Theano. Currently they can harness multiple GPUs on a single machine, but are unable to use GPUs that are distributed across multiple machines; as even average-sized DNNs can take days to train on a single GPU with 100s of GBs to TBs of data, distributed GPUs present a prime opportunity for scaling up DL. However, the limited bandwidth available on commodity Ethernet networks presents a bottleneck to distributed GPU training, and prevents its trivial realization. To investigate how to adapt existing frameworks to efficiently support distributed GPUs, we propose Poseidon, a scalable system architecture for distributed inter-machine communication in existing DL frameworks. We integrate Poseidon with Caffe and evaluate its performance at training DNNs for object recognition. Poseidon features three key contributions that accelerate DNN training on clusters: (1) a three-level hybrid architecture that allows Poseidon to support both CPU-only and GPU-equipped clusters, (2) a distributed wait-free backpropagation (DWBP) algorithm to improve GPU utilization and to balance communication, and (3) a structure-aware communication protocol (SACP) to minimize communication overheads. We empirically show that Poseidon converges to same objectives as a single machine, and achieves state-of-art training speedup across multiple models and well-established datasets using a commodity GPU cluster of 8 nodes (e.g. 4.5x speedup on AlexNet, 4x on GoogLeNet, 4x on CIFAR-10). On the much larger ImageNet22K dataset, Poseidon with 8 nodes achieves better speedup and competitive accuracy to recent CPU-based distributed systems such as Adam and Le et al., which use 10s to 1000s of nodes.

Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous GPU Clusters

Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster

HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training.

A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters

An Optimization Toolchain Design Of Deep Learning Deployment Based On Heterogeneous Computing Platform

A Novel Co-design Peta-scale Heterogeneous Cluster for Deep Learning Training

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

Poseidon: A System Architecture for Efficient GPU-based Deep Learning on Multiple Machines

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes

HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis

HetHub: A Heterogeneous Distributed Hybrid Training System for Large-Scale Models

LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU cluster

HETHUB: A Distributed Training System with Heterogeneous Cluster for Large-Scale Models

Scaling Deep Learning on GPU and Knights Landing clusters

Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment

HyScale-GNN: A Scalable Hybrid GNN Training System on Single-Node Heterogeneous Architecture

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

ZeRO-Offload: Democratizing Billion-Scale Model Training

EdgeSP: Scalable Multi-device Parallel DNN Inference on Heterogeneous Edge Clusters

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Asteroid: Resource-Efficient Hybrid Pipeline Parallelism for Collaborative DNN Training on Heterogeneous Edge Devices