Asteroid: Resource-Efficient Hybrid Pipeline Parallelism for Collaborative DNN Training on Heterogeneous Edge Devices

Shengyuan Ye,Liekang Zeng,Xiaowen Chu,Guoliang Xing,Xu Chen
2024-08-15
Abstract:On-device Deep Neural Network (DNN) training has been recognized as crucial for privacy-preserving machine learning at the edge. However, the intensive training workload and limited onboard computing resources pose significant challenges to the availability and efficiency of model training. While existing works address these challenges through native resource management optimization, we instead leverage our observation that edge environments usually comprise a rich set of accompanying trusted edge devices with idle resources beyond a single terminal. We propose Asteroid, a distributed edge training system that breaks the resource walls across heterogeneous edge devices for efficient model training acceleration. Asteroid adopts a hybrid pipeline parallelism to orchestrate distributed training, along with a judicious parallelism planning for maximizing throughput under certain resource constraints. Furthermore, a fault-tolerant yet lightweight pipeline replay mechanism is developed to tame the device-level dynamics for training robustness and performance stability. We implement Asteroid on heterogeneous edge devices with both vision and language models, demonstrating up to 12.2x faster training than conventional parallelism methods and 2.1x faster than state-of-the-art hybrid parallelism methods through evaluations. Furthermore, Asteroid can recover training pipeline 14x faster than baseline methods while preserving comparable throughput despite unexpected device exiting and failure.
Distributed, Parallel, and Cluster Computing,Artificial Intelligence,Computer Vision and Pattern Recognition,Machine Learning,Networking and Internet Architecture
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper aims to address the issues of resource constraints and efficiency in on-device deep neural network (DNN) training. Specifically: 1. **Efficient DNN Training in Resource-Constrained Environments**: - When training DNNs on edge devices, limited computational resources and insufficient communication bandwidth lead to prolonged training times and poor stability. - The paper proposes a new system called Asteroid, which accelerates the training process by using a Hybrid Pipeline Parallelism (HPP) mechanism that leverages multiple heterogeneous edge devices. 2. **Effective Collaboration of Heterogeneous Edge Devices**: - Edge environments typically consist of various types of devices (such as tablets, laptops, etc.) with different computational capabilities and memory capacities. - Asteroid maximizes resource utilization among heterogeneous devices through optimized device grouping strategies and micro-batch allocation. 3. **Fault Tolerance and Dynamic Adaptability**: - In practical applications, edge devices may experience failures or dynamic changes (such as device mobility or task switching). - The paper designs a lightweight fault tolerance mechanism that enhances system robustness and performance stability through coarse-grained workload migration and topology-driven model replication. Through these methods, Asteroid achieves efficient distributed training on heterogeneous edge devices, significantly improving training speed and system reliability.