Recovering Global Data Distribution Locally in Federated Learning

Ziyu Yao
2024-09-21
Abstract:Federated Learning (FL) is a distributed machine learning paradigm that enables collaboration among multiple clients to train a shared model without sharing raw data. However, a major challenge in FL is the label imbalance, where clients may exclusively possess certain classes while having numerous minority and missing classes. Previous works focus on optimizing local updates or global aggregation but ignore the underlying imbalanced label distribution across clients. In this paper, we propose a novel approach ReGL to address this challenge, whose key idea is to Recover the Global data distribution Locally. Specifically, each client uses generative models to synthesize images that complement the minority and missing classes, thereby alleviating label imbalance. Moreover, we adaptively fine-tune the image generation process using local real data, which makes the synthetic images align more closely with the global distribution. Importantly, both the generation and fine-tuning processes are conducted at the client-side without leaking data privacy. Through comprehensive experiments on various image classification datasets, we demonstrate the remarkable superiority of our approach over existing state-of-the-art works in fundamentally tackling label imbalance in FL.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the performance degradation issue in Federated Learning (FL) caused by label distribution skew. Specifically, each client in Federated Learning may have a large amount of data for certain classes, while having very little or no data at all for other classes. This uneven label distribution will lead to poor global generalization ability of the model, especially when dealing with minority classes and missing classes. ### Paper Background Federated Learning is a distributed machine - learning paradigm that allows multiple clients to collaborate in training a shared model without sharing the original data. However, one of the main challenges in Federated Learning is label distribution skew. Traditional centralized training can solve this problem by balancing the data set, but in Federated Learning, due to the non - IID (non - independent and identically distributed) characteristics of the data, this problem becomes more complicated. In particular, in real - world applications, the data distribution varies greatly among different clients, resulting in some labels being over - represented in some clients and under - represented or completely missing in other clients. ### Paper Objectives The paper proposes a new method - ReGL (Recovering Global data distribution Locally), aiming to recover the global data distribution by generating synthetic data locally on the clients, thereby alleviating the problem of label distribution skew. Specifically, each client uses a generative model to generate data to supplement the minority classes and missing classes, and then uses these synthetic data together with the real data to train the local model. In this way, the ReGL method can effectively recover the global data distribution and improve the global generalization ability and local personalization performance of the model without compromising data privacy. ### Main Contributions 1. **Generating Synthetic Data**: ReGL utilizes basic generative models (such as Stable Diffusion) to generate synthetic images to supplement the data of minority classes and missing classes, thereby alleviating the problem of label distribution skew. 2. **Adaptive Fine - Tuning**: In order to make the synthetic data closer to the global data distribution, the client uses the local real data to adaptively fine - tune the generative model, so as to better adapt to the distribution of a specific data set. 3. **Experimental Verification**: The paper verifies the effectiveness of the ReGL method through comprehensive experiments on multiple image classification data sets, and the results show that it is significantly superior to the existing state - of - the - art methods in both global generalization and local personalization tasks. ### Experimental Results - **Global Generalization Performance**: The ReGL method performs excellently when dealing with label distribution skew, especially in extreme distribution cases, its performance far exceeds that of traditional methods such as FedAvg. - **Local Personalization Performance**: The ReGL method also performs excellently in local personalization tasks. By using the combination of synthetic data and real data, it significantly improves the performance of the model on each client. ### Conclusion The ReGL method proposed in the paper effectively solves the problem of label distribution skew in Federated Learning by generating synthetic data locally on the clients, and significantly improves the global generalization ability and local personalization performance of the model. The experimental results on multiple data sets verify its effectiveness and superiority.