Abstract:In this work, a <italic>Hybrid Hierarchical Federated Edge Learning</italic> (HHFEL) architecture that consists of a device layer, an edge layer, and a cloud layer over heterogeneous networks, is investigated for large-scale model training. In such systems, learning efficiency is severely degraded by limited communication resources and device heterogeneity in terms of local data distribution and computation capability, especially for synchronous FL mechanisms where the training of each round should wait for the slowest device. To tackle this issue, asynchronous FL is proposed, which allows the devices with powerful computation and communication capabilities exchanging information with the server more frequently. However, this asynchronous FL framework faces a new challenge of low accuracy caused by the imbalanced local model updating. To overcome the shortage of both synchronous and asynchronous FLs, we propose an enhanced online semi-asynchronous FL mechanism between the edge-device layers, where each device trains its local model with the newly generated data and each edge server aggregates a number of local models based on their arrival order in each round. Particularly, devices with faster training speeds would fully utilize the idle time by training their local models repetitively. Meanwhile, synchronous FL with an edge elastic update strategy is adopted to the cloud-edge layers for personalized information exchange. Considering the continuous data generation feature, we formulate the objective problem as an online <italic>Markov Decision Process</italic> (MDP) to realize efficient communication-and-computing HHFEL via joint device selection and resource allocation. Due to the non-convex and combinatorial problem structure, we develop a hybrid <italic>Deep Q-Network</italic> (DQN) and <italic>Deep Deterministic Policy Gradient</italic> (DDPG) approach with low computational complexity to adapt the device selection and resource allocation strategies. Numerical results show the effectiveness of the proposed mechanism compared with existing benchmarks.

Elastic Scheduler: Heterogeneous and Dynamic Deep Learning in the Cloud.

Energy Optimization for Federated Learning on Consumer Mobile Devices with Asynchronous SGD and Application Co-Execution

Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers

Elastic Deep Learning in Multi-Tenant GPU Clusters

ElasticBatch: A Learning-Augmented Elastic Scheduling System for Batch Inference on MIG

EasyScale: Accuracy-consistent Elastic Training for Deep Learning

Dynamic Mini-batch SGD for Elastic Distributed Training: Learning in the Limbo of Resources

Differentiate Quality of Experience Scheduling for Deep Learning Inferences with Docker Containers in the Cloud

ElasticFlow: an Elastic Serverless Training Platform for Distributed Deep Learning.

Elan: Towards Generic and Efficient Elastic Training for Deep Learning

Scheduling Distributed Deep Learning Jobs in Heterogeneous Cluster with Placement Awareness

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Effective Elastic Scaling of Deep Learning Workloads

Deep Learning Workload Scheduling in GPU Datacenters: A Survey

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

Enhanced Hybrid Hierarchical Federated Edge Learning Over Heterogeneous Networks

ELASTIC: Edge Workload Forecasting Based on Collaborative Cloud-Edge Deep Learning

An Optimal Resource Allocator of Elastic Training for Deep Learning Jobs on Cloud

ESG: Pipeline-Conscious Efficient Scheduling of DNN Workflows on Serverless Platforms with Shareable GPUs

VELTAIR: towards high-performance multi-tenant deep learning services via adaptive compilation and scheduling

HeterPS: Distributed deep learning with reinforcement learning based scheduling in heterogeneous environments