Abstract:Deep neural networks (DNNs) have gained tremendous attractions as compelling solutions for applications such as image classification, object detection, speech recognition, and so forth. Its great success comes with excessive trainings to make sure the model accuracy is good enough for those applications. Nowadays, it becomes challenging to train a DNN model because of 1) the model size and data size keep increasing, which usually needs more iterations to train; 2) DNN algorithms evolve rapidly, which requires the training phase to be short for a quick deployment. To address those challenges, distributed training platforms have been proposed to leverage massive server nodes for training, with the hope of significant training time reduction. Therefore, scalability is a critical performance metric to evaluate a distributed training platform. Nevertheless, our analysis reveals that traditional server clusters have poor scalability for training due to the traffic congestions within the server and beyond. The intra-server traffic on the I/O fabric can result in severe congestions and skewed quality of service as high performance devices are competing with each other. Moreover, the traffic congestions on the Ethernet for inter-server communication could also incur significant performance degradation. In this work, we devise a novel distributed training platform, EFLOPS, that adopts an algorithm and system co-design methodology to achieve good scalability. A new server architecture is proposed to alleviate the intra-server congestions. Moreover, a new network topology, BiGraph, is proposed to divide the network into two separate parts, so that there is always a direct connection between any nodes from different parts. Finally, accompany with BiGraph, a topology-aware allreduce algorithm is proposed to eliminate the traffic congestion on the direct connection. The experimental results show that eliminating the congestions on network interface can gain up to 11.3xcommunication speedup. The proposed algorithm and topology can provide further improvement up to 6.08x. The overall performance of ResNet-50 training achieves near-linear scalability, and is competitive to the top-rankings of MLPerf results.

MalleTrain: Deep Neural Network Training on Unfillable Supercomputer Nodes

BFTrainer: Low-Cost Training of Neural Networks on Unfillable Supercomputer Nodes

Joint Dynamic Data and Model Parallelism for Distributed Training of DNNs over Heterogeneous Infrastructure

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression

EFLOPS: Algorithm and System Co-Design for a High Performance Distributed Training Platform

Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity

A Sum-of-Ratios Multi-Dimensional-Knapsack Decomposition for DNN Resource Scheduling

Optimizing Makespan and Resource Utilization for Multi-Dnn Training in GPU Cluster

HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis

HierTrain: Fast Hierarchical Edge AI Learning with Hybrid Parallelism in Mobile-Edge-Cloud Computing

Scaling The Training Of Recurrent Neural Networks On Sunway Taihulight Supercomputer

Deploying and Scaling Distributed Parallel Deep Neural Networks on the Tianhe-3 Prototype System

Scalable Resource Management for Dynamic MEC: An Unsupervised Link-Output Graph Neural Network Approach

Efficient N:M Sparse DNN Training Using Algorithm, Architecture, and Dataflow Co-Design

FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement

Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization

Nnscaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training.

MLPs: Efficient Training of MiniGo on Large-scale Heterogeneous Computing System

Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration