A review on different techniques used to combat the non-IID and heterogeneous nature of data in FL

Venkataraman Natarajan Iyer
2024-01-02
Abstract:Federated Learning (FL) is a machine-learning approach enabling collaborative model training across multiple decentralized edge devices that hold local data samples, all without exchanging these samples. This collaborative process occurs under the supervision of a central server orchestrating the training or via a peer-to-peer network. The significance of FL is particularly pronounced in industries such as healthcare and finance, where data privacy holds paramount importance. However, training a model under the Federated learning setting brings forth several challenges, with one of the most prominent being the heterogeneity of data distribution among the edge devices. The data is typically non-independently and non-identically distributed (non-IID), thereby presenting challenges to model convergence. This report delves into the issues arising from non-IID and heterogeneous data and explores current algorithms designed to address these challenges.
Machine Learning
What problem does this paper attempt to address?
This paper primarily explores the challenges posed by non-independent and identically distributed (non-IID) and heterogeneous data in Federated Learning (FL) and proposes several technical methods to address these challenges. ### Main Issues The core issues the paper attempts to address are: - **Heterogeneous Data**: In federated learning, data from different devices usually have different distributions, making it difficult to train a global model. - **Non-IID Data**: Data samples are not independently and identically distributed, meaning the data distribution on different devices may be inconsistent, leading to difficulties in model convergence or even divergence. ### Specific Challenges - **Model Heterogeneity**: Different clients have different data distributions, making it difficult for the trained global model to perform well on all clients. - **Convergence Challenges**: Heterogeneous and non-IID data may slow down the model convergence speed or even prevent convergence. - **Sampling Bias**: Non-IID data may cause the model to be biased towards specific subgroups, requiring solutions to sampling bias to ensure fairness and generalization ability. - **Adaptability Issues**: Client data changes over time, and ensuring the global model can quickly adapt to local changes without affecting overall performance is a challenge. - **Robustness**: Building models that can generalize across different data sources is a key challenge in federated learning. ### Solutions The paper introduces several methods to address the above challenges: 1. **FedDF (Federated Distillation Fusion)**: Uses knowledge distillation techniques to fuse the knowledge of multiple client models into a global model, improving model accuracy and convergence speed. 2. **FedLbl (Label-based Aggregation Method)**: Aggregates local models based on the number of categories in client data to better handle heterogeneous data. 3. **Def-KT (Decentralized Mutual Learning Algorithm)**: In a decentralized federated learning setup, trains models through mutual knowledge transfer, enhancing model generalization ability and learning capability for unseen data samples. ### Conclusion The paper summarizes the importance of handling heterogeneous and non-IID data in federated learning and proposes some effective solutions. Future research directions include further optimizing model aggregation techniques, dynamic adaptation methods, and handling sparse and imbalanced data.