Abstract:The speed of deep neural networks training has become a big bottleneck of deep learning research and development. For example, training GoogleNet by ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training process, the current deep learning systems heavily rely on the hardware accelerators. However, these accelerators have limited on-chip memory compared with CPUs. To handle large datasets, they need to fetch data from either CPU memory or remote processors. We use both self-hosted Intel Knights Landing (KNL) clusters and multi-GPU clusters as our target platforms. From an algorithm aspect, current distributed machine learning systems are mainly designed for cloud systems. These methods are asynchronous because of the slow network and high fault-tolerance requirement on cloud systems. We focus on Elastic Averaging SGD (EASGD) to design algorithms for HPC clusters. Original EASGD used round-robin method for communication and updating. The communication is ordered by the machine rank ID, which is inefficient on HPC clusters. First, we redesign four efficient algorithms for HPC systems to improve EASGD's poor scaling on clusters. Async EASGD, Async MEASGD, and Hogwild EASGD are faster \textcolor{black}{than} their existing counterparts (Async SGD, Async MSGD, and Hogwild SGD, resp.) in all the comparisons. Finally, we design Sync EASGD, which ties for the best performance among all the methods while being deterministic. In addition to the algorithmic improvements, we use some system-algorithm codesign techniques to scale up the algorithms. By reducing the percentage of communication from 87% to 14%, our Sync EASGD achieves 5.3x speedup over original EASGD on the same platform. We get 91.5% weak scaling efficiency on 4253 KNL cores, which is higher than the state-of-the-art implementation.

Round-Robin Synchronization: Mitigating Communication Bottlenecks in Parameter Servers.

WBSP: Addressing Stragglers in Distributed Machine Learning with Worker-Busy Synchronous Parallel

Petrel: Community-Aware Synchronous Parallel For Heterogeneous Parameter Server

FluentPS: A Parameter Server Design with Low-frequency Synchronization for Distributed Deep Learning

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

Fast Distributed Deep Learning Via Worker-adaptive Batch Sizing

HSP: Hybrid Synchronous Parallelism for Fast Distributed Deep Learning

Petrel: Heterogeneity-Aware Distributed Deep Learning Via Hybrid Synchronization

Learning Efficient Parameter Server Synchronization Policies for Distributed SGD.

Falcon: Towards Computation-Parallel Deep Learning in Heterogeneous Parameter Server

Falcon: Addressing Stragglers in Heterogeneous Parameter Server Via Multiple Parallelism

DRPS: Efficient Disk-Resident Parameter Servers for Distributed Machine Learning.

Exploiting Simultaneous Communications to Accelerate Data Parallel Distributed Deep Learning

Hybrid Parameter Update: Alleviating Imbalance Impacts for Distributed Deep Learning.

Near-Optimal Topology-adaptive Parameter Synchronization in Distributed DNN Training

Semi-Dynamic Load Balancing: Efficient Distributed Learning in Non-Dedicated Environments

Scaling Deep Learning on GPU and Knights Landing clusters

SSD-SSD: Communication sparsification for distributed deep learning training

PSO-PS: Parameter Synchronization with Particle Swarm Optimization for Distributed Training of Deep Neural Networks

Rationing Bandwidth Resources for Mitigating Network Resource Contention in Distributed DNN Training Clusters.

FSP: Towards Flexible Synchronous Parallel Frameworks for Distributed Machine Learning