Abstract:In this work, a <italic>Hybrid Hierarchical Federated Edge Learning</italic> (HHFEL) architecture that consists of a device layer, an edge layer, and a cloud layer over heterogeneous networks, is investigated for large-scale model training. In such systems, learning efficiency is severely degraded by limited communication resources and device heterogeneity in terms of local data distribution and computation capability, especially for synchronous FL mechanisms where the training of each round should wait for the slowest device. To tackle this issue, asynchronous FL is proposed, which allows the devices with powerful computation and communication capabilities exchanging information with the server more frequently. However, this asynchronous FL framework faces a new challenge of low accuracy caused by the imbalanced local model updating. To overcome the shortage of both synchronous and asynchronous FLs, we propose an enhanced online semi-asynchronous FL mechanism between the edge-device layers, where each device trains its local model with the newly generated data and each edge server aggregates a number of local models based on their arrival order in each round. Particularly, devices with faster training speeds would fully utilize the idle time by training their local models repetitively. Meanwhile, synchronous FL with an edge elastic update strategy is adopted to the cloud-edge layers for personalized information exchange. Considering the continuous data generation feature, we formulate the objective problem as an online <italic>Markov Decision Process</italic> (MDP) to realize efficient communication-and-computing HHFEL via joint device selection and resource allocation. Due to the non-convex and combinatorial problem structure, we develop a hybrid <italic>Deep Q-Network</italic> (DQN) and <italic>Deep Deterministic Policy Gradient</italic> (DDPG) approach with low computational complexity to adapt the device selection and resource allocation strategies. Numerical results show the effectiveness of the proposed mechanism compared with existing benchmarks.

ALEPH: Accelerating Distributed Training with Ebpf-Based Hierarchical Gradient Aggregation

HCEC: An efficient geo-distributed deep learning training strategy based on wait-free back-propagation

Enhanced Hybrid Hierarchical Federated Edge Learning Over Heterogeneous Networks

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

Hierarchical federated learning based on wireless D2D networks

Heter-Train: A Distributed Training Framework Based on Semi-Asynchronous Parallel Mechanism for Heterogeneous Intelligent Transportation Systems

Identifying Performance Bottleneck in Shared In-Network Aggregation During Distributed Training

AggTree: A Routing Tree with In-Network Aggregation for Distributed Training

No One Idles: Efficient Heterogeneous Federated Learning with Parallel Edge and Server Computation.

Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training

SAP-SGD: Accelerating Distributed Parallel Training with High Communication Efficiency on Heterogeneous Clusters

EP4DDL: addressing straggler problem in heterogeneous distributed deep learning

A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters

Optimizing Training Efficiency and Cost of Hierarchical Federated Learning in Heterogeneous Mobile-Edge Cloud Computing

Cloudless-Training: A Framework to Improve Efficiency of Geo-Distributed ML Training

Decentralized Training of Foundation Models in Heterogeneous Environments

AEDFL: Efficient Asynchronous Decentralized Federated Learning with Heterogeneous Devices

Augmenting Distributed AI Training with Loss-tolerant Transmission.

Heterogeneity-Aware Resource Allocation and Topology Design for Hierarchical Federated Edge Learning

End-to-end Adaptive Distributed Training on PaddlePaddle

Peering Beyond the Gradient Veil with Distributed Auto Differentiation