Abstract:Deep neural networks (DNNs) have gained tremendous attractions as compelling solutions for applications such as image classification, object detection, speech recognition, and so forth. Its great success comes with excessive trainings to make sure the model accuracy is good enough for those applications. Nowadays, it becomes challenging to train a DNN model because of 1) the model size and data size keep increasing, which usually needs more iterations to train; 2) DNN algorithms evolve rapidly, which requires the training phase to be short for a quick deployment. To address those challenges, distributed training platforms have been proposed to leverage massive server nodes for training, with the hope of significant training time reduction. Therefore, scalability is a critical performance metric to evaluate a distributed training platform. Nevertheless, our analysis reveals that traditional server clusters have poor scalability for training due to the traffic congestions within the server and beyond. The intra-server traffic on the I/O fabric can result in severe congestions and skewed quality of service as high performance devices are competing with each other. Moreover, the traffic congestions on the Ethernet for inter-server communication could also incur significant performance degradation. In this work, we devise a novel distributed training platform, EFLOPS, that adopts an algorithm and system co-design methodology to achieve good scalability. A new server architecture is proposed to alleviate the intra-server congestions. Moreover, a new network topology, BiGraph, is proposed to divide the network into two separate parts, so that there is always a direct connection between any nodes from different parts. Finally, accompany with BiGraph, a topology-aware allreduce algorithm is proposed to eliminate the traffic congestion on the direct connection. The experimental results show that eliminating the congestions on network interface can gain up to 11.3xcommunication speedup. The proposed algorithm and topology can provide further improvement up to 6.08x. The overall performance of ResNet-50 training achieves near-linear scalability, and is competitive to the top-rankings of MLPerf results.

Design and Performance Modeling of A YARN-based GPU Resource Scheduling System

Deep Learning Workload Scheduling in GPU Datacenters: A Survey

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

A Virtual Multi-Channel GPU Fair Scheduling Method for Virtual Machines.

Energy-Efficient GPU Clusters Scheduling for Deep Learning

Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters

A Topology-Aware Performance Prediction Model for Distributed Deep Learning on GPU Clusters

Liquid: Intelligent Resource Estimation and Network-Efficient Scheduling for Deep Learning Jobs on Distributed GPU Clusters

Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training.

On Scheduling Ring-All-Reduce Learning Jobs in Multi-Tenant GPU Clusters with Communication Contention

Scheduling Distributed Deep Learning Jobs in Heterogeneous Cluster with Placement Awareness

Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models

A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters

EFLOPS: Algorithm and System Co-Design for a High Performance Distributed Training Platform

GreenFlow: A Carbon-Efficient Scheduler for Deep Learning Workloads

HETHUB: A Distributed Training System with Heterogeneous Cluster for Large-Scale Models

Kale: Elastic GPU Scheduling for Online DL Model Training

GMI-DRL: Empowering Multi-GPU Deep Reinforcement Learning with GPU Spatial Multiplexing