Abstract:Many distributed training techniques like Parameter Server and AllReduce have been proposed to take advantage of the increasingly large data and rich features. However, stragglers frequently occur in distributed training due to resource contention and hardware heterogeneity, which significantly hampers the training efficiency. Previous works only address part of the stragglers and could not adaptively solve various stragglers in practice. Additionally, it is challenging to use a systematic framework to address all stragglers because different stragglers require diverse data allocation and fault-tolerance mechanisms. Therefore, this paper proposes a unified distributed training framework called AntDT (Ant Distributed Training Framework) to adaptively solve the straggler problems. Firstly, the framework consists of four components, including the Stateful Dynamic Data Sharding service, Monitor, Controller, and Agent. These components work collaboratively to efficiently distribute workloads and provide a range of pre-defined straggler mitigation methods with fault tolerance, thereby hiding messy details of data allocation and fault handling. Secondly, the framework provides a high degree of flexibility, allowing for the customization of straggler mitigation solutions based on the specific circumstances of the cluster. Leveraging this flexibility, we introduce two straggler mitigation solutions, namely AntDT-ND for non-dedicated clusters and AntDT-DD for dedicated clusters, as practical examples to resolve various types of stragglers at Ant Group. Justified by our comprehensive experiments and industrial deployment statistics, AntDT outperforms other SOTA methods more than 3x in terms of training efficiency. Additionally, in Alipay's homepage recommendation scenario, using AntDT reduces the training duration of the ranking model from 27.8 hours to just 5.4 hours.

Augmenting Distributed AI Training with Loss-tolerant Transmission.

Boosting Distributed Machine Learning Training Through Loss-tolerant Transmission Protocol

ROG: A High Performance and Robust Distributed Training System for Robotic IoT

WBSP: Addressing Stragglers in Distributed Machine Learning with Worker-Busy Synchronous Parallel

Over-the-air Learning Rate Optimization for Federated Learning

Learning-efficient Transmission Scheduling for Distributed Knowledge-aware Edge Learning.

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

Heter-Train: A Distributed Training Framework Based on Semi-Asynchronous Parallel Mechanism for Heterogeneous Intelligent Transportation Systems

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

TSEngine: Enable Efficient Communication Overlay in Distributed Machine Learning in WANs

Cloudless-Training: A Framework to Improve Efficiency of Geo-Distributed ML Training

A Parameter Communication Optimization Strategy for Distributed Machine Learning in Sensors

UniFL: Enabling Loss-tolerant Transmission in Federated Learning

OSP: Boosting Distributed Model Training with 2-Stage Synchronization

FTPipeHD: A Fault-Tolerant Pipeline-Parallel Distributed Training Approach for Heterogeneous Edge Devices

FTPipeHD: A Fault-Tolerant Pipeline-Parallel Distributed Training Framework for Heterogeneous Edge Devices

EP4DDL: addressing straggler problem in heterogeneous distributed deep learning

Accelerating Geo-distributed Machine Learning with Network-Aware Adaptive Tree and Auxiliary Route

AntDT: A Self-Adaptive Distributed Training Framework for Leader and Straggler Nodes

Communication Efficient Distributed Training with Distributed Lion

Near-Optimal Topology-adaptive Parameter Synchronization in Distributed DNN Training