Abstract:The emergence of edge computing provides an effective solution to execute distributed model training (DMT). The deployment of training data among edge nodes affects the training efficiency and network resource usage. This letter aims for the efficient provisioning of DMT services by optimizing the partition and distribution of training data in edge computing-enabled optical networks. An integer linear programming (ILP) model and a data parallelism deployment algorithm (DPDA) are proposed to solve this problem. The performance of the proposed approaches is evaluated through simulation. Simulation results show that the proposed algorithm can deploy more DMT services compared with benchmark.

What problem does this paper attempt to address?

This paper attempts to address the problem of how to efficiently deploy Distributed Model Training (DMT) services in Edge Computing-supported Elastic Optical Networks (EC-EONs). Specifically, the paper focuses on how to improve the efficiency of DMT services and reduce the use of network resources by optimizing the partitioning and distribution of training data. ### Background and Problem Description With the development of artificial intelligence, the number of AI-based applications and services has surged, and many enterprises require AI services provided by cloud service providers, including data analysis and model training. Model training is a time-consuming and resource-intensive process that typically requires a large amount of storage and computing resources to process vast amounts of raw data. To shorten training time and alleviate the resource demand on a single node, cloud-edge collaborative Distributed Model Training (DMT) has been proposed, which mainly includes model parallelism and data parallelism. In practical systems, the main challenge faced by data-parallel DMT is how to efficiently allocate training data to multiple edge nodes. Different data partitioning and distribution strategies affect the use of computing and transmission resources in the network. Under limited network resources, given a batch of users' training tasks, cloud service providers aim to find the optimal data partitioning and distribution scheme for each task to execute as many DMT tasks as possible. ### Solution The paper proposes two methods to address this problem: 1. **Integer Linear Programming (ILP) Model**: Used to find the optimal solution in small networks. 2. **Data Parallel Deployment Algorithm (DPDA)**: Used to find an approximate optimal solution in large networks. ### Performance Evaluation Through simulations, the paper evaluates the performance of the proposed methods. The simulation results show that the proposed algorithms can deploy more DMT services than benchmark algorithms and perform better in terms of resource utilization and iteration time efficiency. ### Main Contributions - Proposed an ILP model to optimize the partitioning and distribution of training data to maximize the deployment of DMT services. - Designed a heuristic algorithm (DPDA) suitable for large-scale networks, capable of effectively allocating computing and transmission resources. - Validated the effectiveness of the proposed methods through simulations, particularly highlighting advantages in resource utilization and task blocking rate. ### Conclusion By optimizing the partitioning and distribution of training data, the paper improves the efficiency of distributed model training services in edge computing-supported elastic optical networks. The proposed ILP model and DPDA algorithm perform excellently in resource allocation and task deployment, effectively meeting the demands of large-scale DMT tasks.

Distributed Model Training Based on Data Parallelism in Edge Computing-Enabled Elastic Optical Networks

Efficient Partitioning and Communication Scheme-Based Distributed Edge Computing to Accelerate Deep Neural Network

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

D2D-Enabled Data Sharing for Distributed Machine Learning at Wireless Network Edge

Decentralized Proactive Model Offloading and Resource Allocation for Split and Federated Learning

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning

Heter-Train: A Distributed Training Framework Based on Semi-Asynchronous Parallel Mechanism for Heterogeneous Intelligent Transportation Systems

Implementation of Big AI Models for Wireless Networks with Collaborative Edge Computing

Resource-efficient Parallel Split Learning in Heterogeneous Edge Computing

FTPipeHD: A Fault-Tolerant Pipeline-Parallel Distributed Training Approach for Heterogeneous Edge Devices

FTPipeHD: A Fault-Tolerant Pipeline-Parallel Distributed Training Framework for Heterogeneous Edge Devices

EFLOPS: Algorithm and System Co-Design for a High Performance Distributed Training Platform

Distributed Task Offloading in Cooperative Mobile Edge Computing Networks

Deep Reinforcement Learning Method for Task Offloading in Mobile Edge Computing Networks Based on Parallel Exploration with Asynchronous Training

An Online Approach for DNN Model Caching and Processor Allocation in Edge Computing

Edge–IoT Computing and Networking Resource Allocation for Decomposable Deep Learning Inference

Computational Offloading in Semantic-Aware Cloud-Edge-End Collaborative Networks

Distributed Deep Learning Model for Intelligent Video Surveillance Systems with Edge Computing

Collaborate Edge and Cloud Computing With Distributed Deep Learning for Smart City Internet of Things

Resource Allocation for Stable LLM Training in Mobile Edge Computing