Heterogeneity-Aware Resource Allocation and Topology Design for Hierarchical Federated Edge Learning

Zhidong Gao,Yu Zhang,Yanmin Gong,Yuanxiong Guo
2024-09-29
Abstract:Federated Learning (FL) provides a privacy-preserving framework for training machine learning models on mobile edge devices. Traditional FL algorithms, e.g., FedAvg, impose a heavy communication workload on these devices. To mitigate this issue, Hierarchical Federated Edge Learning (HFEL) has been proposed, leveraging edge servers as intermediaries for model aggregation. Despite its effectiveness, HFEL encounters challenges such as a slow convergence rate and high resource consumption, particularly in the presence of system and data heterogeneity. However, existing works are mainly focused on improving training efficiency for traditional FL, leaving the efficiency of HFEL largely unexplored. In this paper, we consider a two-tier HFEL system, where edge devices are connected to edge servers and edge servers are interconnected through peer-to-peer (P2P) edge backhauls. Our goal is to enhance the training efficiency of the HFEL system through strategic resource allocation and topology design. Specifically, we formulate an optimization problem to minimize the total training latency by allocating the computation and communication resources, as well as adjusting the P2P connections. To ensure convergence under dynamic topologies, we analyze the convergence error bound and introduce a model consensus constraint into the optimization problem. The proposed problem is then decomposed into several subproblems, enabling us to alternatively solve it online. Our method facilitates the efficient implementation of large-scale FL at edge networks under data and system heterogeneity. Comprehensive experiment evaluation on benchmark datasets validates the effectiveness of the proposed method, demonstrating significant reductions in training latency while maintaining the model accuracy compared to various baselines.
Machine Learning,Artificial Intelligence,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the training efficiency of the Hierarchical Federated Edge Learning (HFEL) system in heterogeneous systems. Specifically, the paper focuses on how to reduce training latency through resource allocation and topology design while maintaining model accuracy. The main challenges faced by the HFEL system include: 1. **System heterogeneity**: The resources of edge devices (such as battery life, communication bandwidth, CPU frequency) are limited and unevenly distributed. In traditional HFEL, faster devices have to wait for slower devices to complete training, which leads to a waste of resources. 2. **Data heterogeneity**: The data collected by different devices is affected by geographical location or operating environment, resulting in non - independent and identically distributed (non - IID) data distribution, which will affect the convergence speed and accuracy of the model. To address these challenges, the paper proposes an optimization method to improve the training efficiency of the HFEL system in the following ways: - **Construction of optimization problem**: The paper constructs an optimization problem aimed at minimizing the total training latency while ensuring model convergence. The optimization problem takes into account the allocation of computing and communication resources and the adjustment of peer - to - peer (P2P) connections. - **Model consensus constraint**: To ensure convergence under a dynamic topology, the paper introduces a model consensus constraint and analyzes the convergence error bound. - **Design of online algorithm**: The proposed optimization problem is decomposed into multiple sub - problems and iteratively solved by an online algorithm (FedRT), which dynamically adjusts control decisions according to the real - time evaluated system state, environmental conditions and available resources. The paper verifies the effectiveness of the proposed method through experiments on benchmark datasets, and the results show that this method is superior to multiple baseline methods in reducing training latency and improving convergence speed.