FedMoE: Personalized Federated Learning via Heterogeneous Mixture of Experts

Hanzi Mei,Dongqi Cai,Ao Zhou,Shangguang Wang,Mengwei Xu
2024-08-21
Abstract:As Large Language Models (LLMs) push the boundaries of AI capabilities, their demand for data is growing. Much of this data is private and distributed across edge devices, making Federated Learning (FL) a de-facto alternative for fine-tuning (i.e., FedLLM). However, it faces significant challenges due to the inherent heterogeneity among clients, including varying data distributions and diverse task types. Towards a versatile FedLLM, we replace traditional dense model with a sparsely-activated Mixture-of-Experts (MoE) architecture, whose parallel feed-forward networks enable greater flexibility. To make it more practical in resource-constrained environments, we present FedMoE, the efficient personalized FL framework to address data heterogeneity, constructing an optimal sub-MoE for each client and bringing the knowledge back to global MoE. FedMoE is composed of two fine-tuning stages. In the first stage, FedMoE simplifies the problem by conducting a heuristic search based on observed activation patterns, which identifies a suboptimal submodel for each client. In the second stage, these submodels are distributed to clients for further training and returned for server aggregating through a novel modular aggregation strategy. Meanwhile, FedMoE progressively adjusts the submodels to optimal through global expert recommendation. Experimental results demonstrate the superiority of our method over previous personalized FL methods.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively deal with the problem of client - side data heterogeneity in Federated Learning (FL), especially during the fine - tuning process of Large Language Models (LLMs). Specifically, the paper focuses on how to improve model performance through personalized Federated Learning (personalized FL) while reducing memory usage and network traffic on resource - constrained edge devices. ### Background and Motivation 1. **Federated Learning (FL)**: - FL allows multiple edge devices to collaboratively train a shared model through cloud - edge communication without sharing local data, thus protecting privacy. - However, due to the possible divergence or conflict in the update directions of different tasks, it is difficult for the global model to reach optimal convergence. - The resource limitations of edge devices and network transmission bottlenecks also restrict the scale and overall performance of the model. 2. **Mixture of Experts (MoE)**: - The MoE architecture realizes the efficient expansion of model capacity through the sparse activation mechanism and expert parallel structure, significantly improving the performance of downstream tasks. - Each expert can handle specific tasks or data subsets, so the MoE model performs well in multi - task learning and heterogeneous data environments. ### Main Contributions of the Paper 1. **Preliminary Experiments**: - Through a series of preliminary experiments based on Switch Transformers, the characteristics of expert activation were studied, and it was found that the expert activation frequency is dynamically changing during the fine - tuning process, and different data distributions prefer different subsets of experts. 2. **FedMoE System**: - An efficient Federated Learning system FedMoE was proposed, which integrates the Transformer - based MoE model to deal with the data heterogeneity problem. - FedMoE dynamically searches and assigns personalized experts to different clients and re - absorbs the knowledge back into the general global model. 3. **Experimental Verification**: - Through extensive experiments, the effectiveness of FedMoE was verified. The experimental results show that FedMoE achieves better performance than existing baseline methods in all tasks while reducing memory usage and network traffic. ### Method Overview 1. **Problem Definition**: - In personalized Federated Learning, multiple clients collaborate to learn multiple downstream tasks, and each client conducts edge training according to the local data set. - The goal of the client is to minimize the label - smoothed cross - entropy loss and the weighted load - balancing loss within the memory limit. 2. **Model Structure**: - A large MoE model is hosted in the cloud, and each client hosts a heterogeneous sub - MoE model, which is sampled from the global model and retains the most relevant experts to adapt to the data characteristics. 3. **Workflow**: - **First Stage**: Coarse - grained Sub - model Initialization - By collecting the activation information of clients, a heuristic search is carried out based on the expert activation probability to determine the initial sub - model architecture of each client. - **Second Stage**: Federated Training and Fine - grained Sub - model Adjustment - Based on the sub - model initialized in the first stage, Federated training is carried out. The knowledge of the client's sub - model is integrated back into the global model through the modular aggregation strategy, and fine - grained structural adjustments are made according to real - time feedback. ### Experimental Results - **Performance Comparison under Different Settings**: - Under various settings such as Standard - Hetero - T (Standard Heterogeneous Task), Standard - Hetero - TD (Standard Heterogeneous Task and Data Distribution), Enforced - Hetero - T (Enforced Heterogeneous Task) and Enforced - Hetero - TD (Enforced Heterogeneous Task and Data Distribution), FedMoE shows superior performance, especially in terms of memory usage and communication volume. ### Conclusion The paper effectively solves the data heterogeneity problem in Federated Learning through the FedMoE system.