On the Impact of Data Heterogeneity in Federated Learning Environments with Application to Healthcare Networks

Usevalad Milasheuski,Luca Barbieri,Bernardo Camajori Tedeschini,Monica Nicoli,Stefano Savazzi
2024-09-05
Abstract:Federated Learning (FL) allows multiple privacy-sensitive applications to leverage their dataset for a global model construction without any disclosure of the information. One of those domains is healthcare, where groups of silos collaborate in order to generate a global predictor with improved accuracy and generalization. However, the inherent challenge lies in the high heterogeneity of medical data, necessitating sophisticated techniques for assessment and compensation. This paper presents a comprehensive exploration of the mathematical formalization and taxonomy of heterogeneity within FL environments, focusing on the intricacies of medical data. In particular, we address the evaluation and comparison of the most popular FL algorithms with respect to their ability to cope with quantity-based, feature and label distribution-based heterogeneity. The goal is to provide a quantitative evaluation of the impact of data heterogeneity in FL systems for healthcare networks as well as a guideline on FL algorithm selection. Our research extends beyond existing studies by benchmarking seven of the most common FL algorithms against the unique challenges posed by medical data use cases. The paper targets the prediction of the risk of stroke recurrence through a set of tabular clinical reports collected by different federated hospital silos: data heterogeneity frequently encountered in this scenario and its impact on FL performance are discussed.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the issue of the impact of data heterogeneity on model performance in a Federated Learning (FL) environment, particularly within healthcare networks. Specifically, the paper focuses on how to evaluate and compare the ability of different FL algorithms to handle heterogeneity in quantity, feature distribution, and label distribution to improve model accuracy and generalization capability. ### Background and Problem Description - **Federated Learning (FL)**: FL allows multiple privacy-sensitive applications to build a global model using their datasets without disclosing information. The healthcare field is a significant application area for FL, where data silos (such as hospitals) collaborate to generate a global predictor with higher accuracy and generalization capability. - **Data Heterogeneity**: High heterogeneity in medical data is one of the main challenges faced by FL. Data collected by different hospitals may have significant differences in patient demographics, medical equipment, and clinical practices, leading to non-independent and identically distributed (non-IID) data distributions. ### Objectives of the Paper - **Mathematical Formalization and Classification**: The paper provides a comprehensive mathematical formalization and classification of data heterogeneity in the FL environment, particularly addressing the complexity of medical data. - **Algorithm Evaluation and Comparison**: Evaluates and compares the performance of 7 of the most popular FL algorithms in handling heterogeneity in quantity, feature distribution, and label distribution. - **Quantitative Evaluation and Guidelines**: Offers a method for quantitatively evaluating the impact of data heterogeneity on FL systems and provides guidelines for selecting appropriate FL algorithms. ### Main Contributions 1. **Classification of Heterogeneity Types**: Discusses the main types of heterogeneity in medical tabular data, including label distribution skew, quantity skew, and feature distribution skew. 2. **Data Heterogeneity Simulation**: Proposes methods for simulating data heterogeneity in the FL environment and designs a real-time FL network system based on the MQTT protocol. 3. **Algorithm Benchmarking**: Benchmarks 7 state-of-the-art FL algorithms under different heterogeneity settings to validate their sensitivity in handling data heterogeneity. 4. **Case Study**: Uses a publicly available stroke recurrence risk prediction dataset to explore the impact of data heterogeneity on FL performance and provides insights into algorithm selection. ### Conclusion Through detailed experimental analysis, the paper demonstrates the strengths and weaknesses of different FL algorithms in handling medical data heterogeneity, providing important references for FL applications in the healthcare field. Specifically, SCAFFOLD and FedDyn perform well in handling feature imbalance, while FedProx, FedNova, and FedAvg are more suitable when device bandwidth or computational capacity is limited. Overall, selecting an appropriate FL algorithm requires a preliminary assessment based on the specific level of heterogeneity.