Abstract:Many distributed training techniques like Parameter Server and AllReduce have been proposed to take advantage of the increasingly large data and rich features. However, stragglers frequently occur in distributed training due to resource contention and hardware heterogeneity, which significantly hampers the training efficiency. Previous works only address part of the stragglers and could not adaptively solve various stragglers in practice. Additionally, it is challenging to use a systematic framework to address all stragglers because different stragglers require diverse data allocation and fault-tolerance mechanisms. Therefore, this paper proposes a unified distributed training framework called AntDT (Ant Distributed Training Framework) to adaptively solve the straggler problems. Firstly, the framework consists of four components, including the Stateful Dynamic Data Sharding service, Monitor, Controller, and Agent. These components work collaboratively to efficiently distribute workloads and provide a range of pre-defined straggler mitigation methods with fault tolerance, thereby hiding messy details of data allocation and fault handling. Secondly, the framework provides a high degree of flexibility, allowing for the customization of straggler mitigation solutions based on the specific circumstances of the cluster. Leveraging this flexibility, we introduce two straggler mitigation solutions, namely AntDT-ND for non-dedicated clusters and AntDT-DD for dedicated clusters, as practical examples to resolve various types of stragglers at Ant Group. Justified by our comprehensive experiments and industrial deployment statistics, AntDT outperforms other SOTA methods more than 3x in terms of training efficiency. Additionally, in Alipay's homepage recommendation scenario, using AntDT reduces the training duration of the ranking model from 27.8 hours to just 5.4 hours.

Identifying Performance Bottleneck in Shared In-Network Aggregation During Distributed Training

In-Network Aggregation with Transport Transparency for Distributed Training

Enabling Switch Memory Management for Distributed Training with In-Network Aggregation.

Training Job Placement in Clusters with Statistical In-Network Aggregation

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

AggTree: A Routing Tree with In-Network Aggregation for Distributed Training

ATP: In-network Aggregation for Multi-tenant Learning.

AntDT: A Self-Adaptive Distributed Training Framework for Leader and Straggler Nodes

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

Rina: Enhancing Ring-AllReduce with In-network Aggregation in Distributed Model Training

Preemptive Switch Memory Usage to Accelerate Training Jobs with Shared In-Network Aggregation

EP4DDL: addressing straggler problem in heterogeneous distributed deep learning

Simplifying Distributed Neural Network Training on Massive Graphs: Randomized Partitions Improve Model Aggregation

Efficient Data-Plane Memory Scheduling for In-Network Aggregation

ARGO: An Auto-Tuning Runtime System for Scalable GNN Training on Multi-Core Processor

Proteus: Simulating the Performance of Distributed DNN Training

Is Network the Bottleneck of Distributed Training?

iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud

Asteroid: Resource-Efficient Hybrid Pipeline Parallelism for Collaborative DNN Training on Heterogeneous Edge Devices

From promise to practice: realizing high-performance decentralized training

Rationing Bandwidth Resources for Mitigating Network Resource Contention in Distributed DNN Training Clusters.