Abstract:The parameter server (PS) paradigm has achieved great success in deploying large-scale distributed Deep Learning (DL) systems. However, these systems implicitly assume that the cluster is homogeneous and this assumption does not hold in many real-world cases. Although the previous efforts are paid to address heterogeneity, they mainly prioritize the contribution of fast workers and reduce the involvement of slow workers, resulting in the limitations of workload imbalance and computation inefficiency. We reveal that grouping workers into communities, an abstraction proposed by us, and handling parameter synchronization at the community level can conquer these limitations and accelerate the training convergence progress. The inspiration of community comes from our exploration of prior knowledge about the similarity between workers, which is often neglected by previous work. These observations motivate us to propose a new synchronization mechanism named Community-aware Synchronous Parallel (CASP), which uses the Asynchronous Advantage Actor-Critic (A3C)-based algorithm to intelligently determine community configuration and fully improve the synchronization performance. The whole idea has been implemented in a prototype system called ${sf Petrel}$<math>Petrel</math> that achieves a good balance between convergence efficiency and communication overhead. The evaluation under various benchmarks with multiple metrics and baseline comparison demonstrates the effectiveness of ${sf Petrel}$<math>Petrel</math>. Specifically, ${sf Petrel}$<math>Petrel</math> accelerates the training convergence speed by up to 1.87 × faster and reduces communication traffic by up to 26.85 percent, on average, over the non-community synchronization mechanisms.

HiPS - Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning.

Impact of Synchronization Topology on DML Performance: Both Logical Topology and Physical Topology

WBSP: Addressing Stragglers in Distributed Machine Learning with Worker-Busy Synchronous Parallel

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

Near-Optimal Topology-adaptive Parameter Synchronization in Distributed DNN Training

OSP: Boosting Distributed Model Training with 2-Stage Synchronization

Accelerating Distributed Machine Learning by Smart Parameter Server

SDPipe: A Semi-Decentralized Framework for Heterogeneity-aware Pipeline-parallel Training.

HeterPS: Distributed deep learning with reinforcement learning based scheduling in heterogeneous environments

HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training.

Petrel: Heterogeneity-Aware Distributed Deep Learning Via Hybrid Synchronization

Heterogeneity-Aware Distributed Parameter Servers

BML: A High-performance, Low-cost Gradient Synchronization Algorithm for DML Training

Distributed Machine Learning through Heterogeneous Edge Systems

A Quadratic Synchronization Rule for Distributed Deep Learning

Priority-based Parameter Propagation for Distributed DNN Training

A Parameter Communication Optimization Strategy for Distributed Machine Learning in Sensors

A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters

Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training

Scheduling Distributed Deep Learning Jobs in Heterogeneous Cluster with Placement Awareness