Abstract:Recently, due to privacy concerns, distributed machine learning in Wide-Area Networks (DML-WANs) attracts increasing attention and has been widely deployed to promote the widespread application of intelligence services that rely on geographically distributed data. DML-WANs is essentially performing collaboratively federated learning over a combination of servers at both edge and cloud on a large spatial scale. However, efficient model training is challenging for DML-WANs because it is blocked by the high overhead of model parameter synchronization between computing servers over WANs. The reason is that there has a sequential dependency between local model computing and global model synchronization of traditional DML-WANs training methods intrinsically producing a sequential blockage between them, e.g., FedAvg. When the computing heterogeneity and the low WAN bandwidth coexist, a long block of global model synchronization prolongs the training time and leads to low utilization of local computing. Despite many efforts on alleviating synchronization overhead with novel communication technologies and synchronization methods, they still use traditional training patterns with sequential dependency and thereby have very limited improvements, such as FedAsync and ESync. In this article, we propose NBSync, a novel training algorithm for DML-WANs, which greatly speeds up the model training by the parallelism of local computing and global synchronization. NBSync employs a well-designed pipelining scheme, which can properly relax the sequential dependency of local computing and global synchronization and process them in parallel so as to overlap their operating overhead in the time dimension. NBSync also realizes flexible, differentiated and dynamical local computing for workers to maximize the overlap ratio in dynamically heterogeneous training environments. Convergence analysis shows that the convergence rate of NBSync training process is asymptotically equal to that of SSGD, and NBSync has a better convergence efficiency. We implemented the prototype of NBSync based on a popular parameter server system, i.e., MXNET's PS-LITE library, and evaluate its performance on a DML-WANs testbed. Experimental results show that NBSync speeds up training about 1.43×–2.79× than state-of-the-art distributed training algorithms (DTAs) in DML-WANs scenarios where computing heterogeneity and low WAN bandwidth coexist.

NBSync: Parallelism of Local Computing and Global Synchronization for Fast Distributed Machine Learning in WANs

WBSP: Addressing Stragglers in Distributed Machine Learning with Worker-Busy Synchronous Parallel

FedGSync: Jointly Optimized Weak Synchronization and Gradient Transmission for Fast Distributed Machine Learning in Heterogeneous WAN

Accelerating Model Synchronization for Distributed Machine Learning in an Optical Wide Area Network

FLSGD: Free Local SGD with Parallel Synchronization

An Adaptive Synchronous Parallel Strategy for Distributed Machine Learning

DSANA: A Distributed Machine Learning Acceleration Solution Based on Dynamic Scheduling and Network Acceleration

TSEngine: Enable Efficient Communication Overlay in Distributed Machine Learning in WANs

BML: A High-performance, Low-cost Gradient Synchronization Algorithm for DML Training

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

Adaptive Load Balancing for Parameter Servers in Distributed Machine Learning over Heterogeneous Networks

Gsyn: Reducing Staleness and Communication Waiting Via Grouping-based Synchronization for Distributed Deep Learning

Near-Optimal Topology-adaptive Parameter Synchronization in Distributed DNN Training

AutoSync: Learning to Synchronize for Data-Parallel Distributed Deep Learning.

Accelerating Massively Distributed Deep Learning Through Efficient Pseudo-Synchronous Update Method

ESync: Accelerating Intra-Domain Federated Learning in Heterogeneous Data Centers

Efficient Communication Scheduling for Parameter Synchronization of DML in Data Center Networks

SSD-SGD: Communication Sparsification for Distributed Deep Learning Training.

SSD-SSD: Communication sparsification for distributed deep learning training

OSP: Boosting Distributed Model Training with 2-Stage Synchronization

More Effective Synchronization Scheme in ML Using Stale Parameters