Distributed Model Training Based on Data Parallelism in Edge Computing-Enabled Elastic Optical Networks

Yajie Li,Zebin Zeng,Jun Li,Boyuan Yan,Yongli Zhao,Jie Zhang
DOI: https://doi.org/10.1109/lcomm.2020.3041453
IF: 3.5529
2021-04-01
IEEE Communications Letters
Abstract:The emergence of edge computing provides an effective solution to execute distributed model training (DMT). The deployment of training data among edge nodes affects the training efficiency and network resource usage. This letter aims for the efficient provisioning of DMT services by optimizing the partition and distribution of training data in edge computing-enabled optical networks. An integer linear programming (ILP) model and a data parallelism deployment algorithm (DPDA) are proposed to solve this problem. The performance of the proposed approaches is evaluated through simulation. Simulation results show that the proposed algorithm can deploy more DMT services compared with benchmark.
telecommunications
What problem does this paper attempt to address?
This paper attempts to address the problem of how to efficiently deploy Distributed Model Training (DMT) services in Edge Computing-supported Elastic Optical Networks (EC-EONs). Specifically, the paper focuses on how to improve the efficiency of DMT services and reduce the use of network resources by optimizing the partitioning and distribution of training data. ### Background and Problem Description With the development of artificial intelligence, the number of AI-based applications and services has surged, and many enterprises require AI services provided by cloud service providers, including data analysis and model training. Model training is a time-consuming and resource-intensive process that typically requires a large amount of storage and computing resources to process vast amounts of raw data. To shorten training time and alleviate the resource demand on a single node, cloud-edge collaborative Distributed Model Training (DMT) has been proposed, which mainly includes model parallelism and data parallelism. In practical systems, the main challenge faced by data-parallel DMT is how to efficiently allocate training data to multiple edge nodes. Different data partitioning and distribution strategies affect the use of computing and transmission resources in the network. Under limited network resources, given a batch of users' training tasks, cloud service providers aim to find the optimal data partitioning and distribution scheme for each task to execute as many DMT tasks as possible. ### Solution The paper proposes two methods to address this problem: 1. **Integer Linear Programming (ILP) Model**: Used to find the optimal solution in small networks. 2. **Data Parallel Deployment Algorithm (DPDA)**: Used to find an approximate optimal solution in large networks. ### Performance Evaluation Through simulations, the paper evaluates the performance of the proposed methods. The simulation results show that the proposed algorithms can deploy more DMT services than benchmark algorithms and perform better in terms of resource utilization and iteration time efficiency. ### Main Contributions - Proposed an ILP model to optimize the partitioning and distribution of training data to maximize the deployment of DMT services. - Designed a heuristic algorithm (DPDA) suitable for large-scale networks, capable of effectively allocating computing and transmission resources. - Validated the effectiveness of the proposed methods through simulations, particularly highlighting advantages in resource utilization and task blocking rate. ### Conclusion By optimizing the partitioning and distribution of training data, the paper improves the efficiency of distributed model training services in edge computing-supported elastic optical networks. The proposed ILP model and DPDA algorithm perform excellently in resource allocation and task deployment, effectively meeting the demands of large-scale DMT tasks.