Abstract:The speed of deep neural networks training has become a big bottleneck of deep learning research and development. For example, training GoogleNet by ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training process, the current deep learning systems heavily rely on the hardware accelerators. However, these accelerators have limited on-chip memory compared with CPUs. To handle large datasets, they need to fetch data from either CPU memory or remote processors. We use both self-hosted Intel Knights Landing (KNL) clusters and multi-GPU clusters as our target platforms. From an algorithm aspect, current distributed machine learning systems are mainly designed for cloud systems. These methods are asynchronous because of the slow network and high fault-tolerance requirement on cloud systems. We focus on Elastic Averaging SGD (EASGD) to design algorithms for HPC clusters. Original EASGD used round-robin method for communication and updating. The communication is ordered by the machine rank ID, which is inefficient on HPC clusters. First, we redesign four efficient algorithms for HPC systems to improve EASGD's poor scaling on clusters. Async EASGD, Async MEASGD, and Hogwild EASGD are faster \textcolor{black}{than} their existing counterparts (Async SGD, Async MSGD, and Hogwild SGD, resp.) in all the comparisons. Finally, we design Sync EASGD, which ties for the best performance among all the methods while being deterministic. In addition to the algorithmic improvements, we use some system-algorithm codesign techniques to scale up the algorithms. By reducing the percentage of communication from 87% to 14%, our Sync EASGD achieves 5.3x speedup over original EASGD on the same platform. We get 91.5% weak scaling efficiency on 4253 KNL cores, which is higher than the state-of-the-art implementation.

RedSync: Reducing Synchronization Bandwidth for Distributed Deep Learning Training System

RedSync : Reducing Synchronization Traffic for Distributed Deep Learning

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training.

Learned Gradient Compression for Distributed Deep Learning

A Generic, High-Performance, Compression-Aware Framework for Data Parallel DNN Training

Evaluation and Optimization of Gradient Compression for Distributed Deep Learning

Sparse Gradient Compression For Distributed Sgd

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes

Scaling Deep Learning on GPU and Knights Landing clusters

An Efficient Bandwidth-Adaptive Gradient Compression Algorithm for Distributed Training of Deep Neural Networks

S2 Reducer: High-Performance Sparse Communication to Accelerate Distributed Deep Learning

Near-Lossless Gradient Compression for Data-Parallel Distributed DNN Training

A Quadratic Synchronization Rule for Distributed Deep Learning

Sparse Communication for Training Deep Networks

Faster Distributed Deep Net Training: Computation and Communication Decoupled Stochastic Gradient Descent

NetReduce: RDMA-Compatible In-Network Reduction for Distributed DNN Training Acceleration

Compressed Communication for Distributed Training: Adaptive Methods and System

Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression

An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems

A Novel Adaptive Gradient Compression Scheme: Reducing the Communication Overhead for Distributed Deep Learning in the Internet of Things

Decentralized Deep Learning with Arbitrary Communication Compression