Abstract:Transportation big data (TBD) are increasingly combined with artificial intelligence to mine novel patterns and information due to the powerful representational capabilities of deep neural networks (DNNs), especially for anti-COVID19 applications. The distributed cloud-edge-vehicle training architecture has been applied to accelerate DNNs training while ensuring low latency and high privacy for TBD processing. However, multiple intelligent devices (e.g., intelligent vehicles, edge computing chips at base stations) and different networks in intelligent transportation systems lead to computing power and communication heterogeneity among distributed nodes. Existing parallel training mechanisms perform poorly on heterogeneous cloud-edge-vehicle clusters. The synchronous parallel mechanism may force fast workers to wait for the slowest worker for synchronization, thus wasting their computing power. The asynchronous mechanism has communication bottlenecks and can exacerbate the straggler problem, causing increased training iterations and even incorrect convergence. In this paper, we introduce a distributed training framework, Heter-Train. First, a communication-efficient semi-asynchronous parallel mechanism (SAP-SGD) is proposed, which can take full advantage of acceleration effect of asynchronous strategy on heterogeneous training and constrain the straggler problem by using global interval synchronization. Second, Considering the difference in node bandwidth, we design a solution for heterogeneous communication. Moreover, a novel weighted aggregation strategy is proposed to aggregate the model parameters with different versions. Finally, experimental results show that our proposed strategy can achieve up to $6.74 \times$ speedups on training time, with almost no accuracy decrease.

Autodist: a composable and automated synchronization system for distributed deep learning

AutoSync: Learning to Synchronize for Data-Parallel Distributed Deep Learning.

AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth Cost

DLB: A Dynamic Load Balance Strategy for Distributed Training of Deep Neural Networks

A Survey on Auto-Parallelism of Large-Scale Deep Learning Training

DistSim: A Performance Model of Large-Scale Hybrid Distributed DNN Training

Optimizing DNN Compilation for Distributed Training with Joint OP and Tensor Fusion

Distributed Training Optimization for DCU

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Heter-Train: A Distributed Training Framework Based on Semi-Asynchronous Parallel Mechanism for Heterogeneous Intelligent Transportation Systems

Communication-Efficient Distributed Deep Learning: A Comprehensive Survey

Joint Dynamic Data and Model Parallelism for Distributed Training of DNNs over Heterogeneous Infrastructure

Accelerating Deep Learning Systems Via Critical Set Identification and Model Compression.

OSP: Boosting Distributed Model Training with 2-Stage Synchronization

TensorOpt: Exploring the Tradeoffs in Distributed DNN Training With Auto-Parallelism

Gsyn: Reducing Staleness and Communication Waiting Via Grouping-based Synchronization for Distributed Deep Learning

A Hierarchical Communication Algorithm for Distributed Deep Learning Training.

Adaptive Distributed Parallel Training Method for a Deep Learning Model Based on Dynamic Critical Paths of DAG

DistIR: An Intermediate Representation and Simulator for Efficient Neural Network Distribution

An Adaptive Synchronous Parallel Strategy for Distributed Machine Learning

EP4DDL: addressing straggler problem in heterogeneous distributed deep learning