Abstract:The speed of deep neural networks training has become a big bottleneck of deep learning research and development. For example, training GoogleNet by ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training process, the current deep learning systems heavily rely on the hardware accelerators. However, these accelerators have limited on-chip memory compared with CPUs. To handle large datasets, they need to fetch data from either CPU memory or remote processors. We use both self-hosted Intel Knights Landing (KNL) clusters and multi-GPU clusters as our target platforms. From an algorithm aspect, current distributed machine learning systems are mainly designed for cloud systems. These methods are asynchronous because of the slow network and high fault-tolerance requirement on cloud systems. We focus on Elastic Averaging SGD (EASGD) to design algorithms for HPC clusters. Original EASGD used round-robin method for communication and updating. The communication is ordered by the machine rank ID, which is inefficient on HPC clusters. First, we redesign four efficient algorithms for HPC systems to improve EASGD's poor scaling on clusters. Async EASGD, Async MEASGD, and Hogwild EASGD are faster \textcolor{black}{than} their existing counterparts (Async SGD, Async MSGD, and Hogwild SGD, resp.) in all the comparisons. Finally, we design Sync EASGD, which ties for the best performance among all the methods while being deterministic. In addition to the algorithmic improvements, we use some system-algorithm codesign techniques to scale up the algorithms. By reducing the percentage of communication from 87% to 14%, our Sync EASGD achieves 5.3x speedup over original EASGD on the same platform. We get 91.5% weak scaling efficiency on 4253 KNL cores, which is higher than the state-of-the-art implementation.

DeepBoot: Dynamic Scheduling System for Training and Inference Deep Learning Tasks in GPU Cluster

ElasticBatch: A Learning-Augmented Elastic Scheduling System for Batch Inference on MIG

An Optimization Toolchain Design Of Deep Learning Deployment Based On Heterogeneous Computing Platform

Energy-Efficient GPU Clusters Scheduling for Deep Learning

SCHED²: Scheduling Deep Learning Training Via Deep Reinforcement Learning.

Dynamic Resource Allocation for Deep Learning Clusters with Separated Compute and Storage

GPU Cluster Scheduling for Network-Sensitive Deep Learning

Scheduling Distributed Deep Learning Jobs in Heterogeneous Cluster with Placement Awareness

Elastic Deep Learning in Multi-Tenant GPU Clusters

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

Accelerating End-to-End Deep Learning Workflow With Codesign of Data Preprocessing and Scheduling.

Deep Neural Network Hardware Deployment Optimization via Advanced Active Learning

DLTAP: A Network-efficient Scheduling Method for Distributed Deep Learning Workload in Containerized Cluster Environment

Liquid: Intelligent Resource Estimation and Network-Efficient Scheduling for Deep Learning Jobs on Distributed GPU Clusters

Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters

FPDeep: Scalable Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters

GreenFlow: A Carbon-Efficient Scheduler for Deep Learning Workloads

Scaling Deep Learning on GPU and Knights Landing clusters

EdgeCI: Distributed Workload Assignment and Model Partitioning for CNN Inference on Edge Clusters

UniSched: A Unified Scheduler for Deep Learning Training Jobs with Different User Demands

Energy-Aware Non-Preemptive Task Scheduling with Deadline Constraint in DVFS-Enabled Heterogeneous Clusters