Asteroid: Resource-Efficient Hybrid Pipeline Parallelism for Collaborative DNN Training on Heterogeneous Edge Devices

Shengyuan Ye,Liekang Zeng,Xiaowen Chu,Guoliang Xing,Xu Chen

2024-08-15

Abstract:On-device Deep Neural Network (DNN) training has been recognized as crucial for privacy-preserving machine learning at the edge. However, the intensive training workload and limited onboard computing resources pose significant challenges to the availability and efficiency of model training. While existing works address these challenges through native resource management optimization, we instead leverage our observation that edge environments usually comprise a rich set of accompanying trusted edge devices with idle resources beyond a single terminal. We propose Asteroid, a distributed edge training system that breaks the resource walls across heterogeneous edge devices for efficient model training acceleration. Asteroid adopts a hybrid pipeline parallelism to orchestrate distributed training, along with a judicious parallelism planning for maximizing throughput under certain resource constraints. Furthermore, a fault-tolerant yet lightweight pipeline replay mechanism is developed to tame the device-level dynamics for training robustness and performance stability. We implement Asteroid on heterogeneous edge devices with both vision and language models, demonstrating up to 12.2x faster training than conventional parallelism methods and 2.1x faster than state-of-the-art hybrid parallelism methods through evaluations. Furthermore, Asteroid can recover training pipeline 14x faster than baseline methods while preserving comparable throughput despite unexpected device exiting and failure.

Distributed, Parallel, and Cluster Computing,Artificial Intelligence,Computer Vision and Pattern Recognition,Machine Learning,Networking and Internet Architecture

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper aims to address the issues of resource constraints and efficiency in on-device deep neural network (DNN) training. Specifically: 1. **Efficient DNN Training in Resource-Constrained Environments**: - When training DNNs on edge devices, limited computational resources and insufficient communication bandwidth lead to prolonged training times and poor stability. - The paper proposes a new system called Asteroid, which accelerates the training process by using a Hybrid Pipeline Parallelism (HPP) mechanism that leverages multiple heterogeneous edge devices. 2. **Effective Collaboration of Heterogeneous Edge Devices**: - Edge environments typically consist of various types of devices (such as tablets, laptops, etc.) with different computational capabilities and memory capacities. - Asteroid maximizes resource utilization among heterogeneous devices through optimized device grouping strategies and micro-batch allocation. 3. **Fault Tolerance and Dynamic Adaptability**: - In practical applications, edge devices may experience failures or dynamic changes (such as device mobility or task switching). - The paper designs a lightweight fault tolerance mechanism that enhances system robustness and performance stability through coarse-grained workload migration and topology-driven model replication. Through these methods, Asteroid achieves efficient distributed training on heterogeneous edge devices, significantly improving training speed and system reliability.

Asteroid: Resource-Efficient Hybrid Pipeline Parallelism for Collaborative DNN Training on Heterogeneous Edge Devices

Extendable Multi-Device Collaborative Pipeline Parallel Inference in the Edge-Cloud Scenario

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

Resource-efficient Parallel Split Learning in Heterogeneous Edge Computing

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

Model Parallelism Optimization for Distributed DNN Inference on Edge Devices.

Pipeline Parallelism for Inference on Heterogeneous Edge Computing

EdgeSP: Scalable Multi-device Parallel DNN Inference on Heterogeneous Edge Clusters

FTPipeHD: A Fault-Tolerant Pipeline-Parallel Distributed Training Approach for Heterogeneous Edge Devices

FTPipeHD: A Fault-Tolerant Pipeline-Parallel Distributed Training Framework for Heterogeneous Edge Devices

Joint Optimization of Device Placement and Model Partitioning for Cooperative DNN Inference in Heterogeneous Edge Computing

Context-Aware Compilation of DNN Training Pipelines across Edge and Cloud

A Novel Adaptive Computation Offloading Strategy for Collaborative DNN Inference over Edge Devices.

HierTrain: Fast Hierarchical Edge AI Learning with Hybrid Parallelism in Mobile-Edge-Cloud Computing

EdgeMesh: A Hybrid Distributed Training Mechanism for Heterogeneous Edge Devices.

Joint Dynamic Data and Model Parallelism for Distributed Training of DNNs over Heterogeneous Infrastructure

Heter-Train: A Distributed Training Framework Based on Semi-Asynchronous Parallel Mechanism for Heterogeneous Intelligent Transportation Systems

Efficient Computer Vision on Edge Devices with Pipeline-Parallel Hierarchical Neural Networks

Accelerating Deep Neural Network Tasks Through Edge-Device Adaptive Inference

PipeEdge: A Trusted Pipelining Collaborative Edge Training Based on Blockchain

Sub-model Parallelism: A Scale-out Deployment Method for Large Multi-modal DNNs