Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

Binbin Huang,Xunqing Huang,Xiao Liu,Chuntao Ding,Yuyu Yin,Shuiguang Deng
DOI: https://doi.org/10.1016/j.comcom.2023.12.034
IF: 5.047
2024-01-01
Computer Communications
Abstract:With the increasing proliferation of Internet-of-Things (IoT) devices, it is a growing trend toward training a deep neural network (DNN) model in pipeline parallelism across resource-constraint IoT devices. To ensure the model convergence and accuracy, synchronous pipeline parallelism is usually adopted. However, the synchronous pipeline can incur a long waiting time due to its gradient aggregation of all microbatches. It is urgent for a DNN model to design an adaptive partitioning and efficient scheduling scheme in heterogeneous IoT environment. To address this problem, we propose a policy gradient based model partitioning and scheduling scheme (PG-MPSS) to minimize per-iteration training time. More specifically, we first design a double-network framework to divide and schedule a DNN model. Then, we adopt a policy gradient algorithm to update the double-network parameters, aiming at learning an optimal double-network model. We conduct extensive experiments to compare the DNN training time of the PG-MPSS scheme with that of Dynamic Programming (DP), Genetic Algorithm (GA), Particle Swarm Optimization (PSO), Average&Greedy (AG) and Proximal Policy Optimization (PPO) five baseline algorithms under different experimental settings. The related experimental results demonstrate that the PG-MPSS scheme can greatly expedite synchronous pipeline training of a DNN model.
What problem does this paper attempt to address?