A Prototype-Assisted Clustered Federated Learning for Big Data Security and Privacy Preservation

Yalan Jiang,Dan Wang,Bin Song,Xiaojiang Du
DOI: https://doi.org/10.1016/j.future.2024.07.032
2024-01-01
Abstract:In the rapidly expanding field of IoT, data production has reached an unprecedented scale, providing valuable insights that accelerate decision-making processes. However, ensuring the privacy and security of this massive amount of data poses significant challenges. In this paper, we propose using clustered federated learning (CFL) as a solution to ensure both the security and privacy of big data by uploading model weights while keeping the data stored locally. Nevertheless, there are practical challenges in applying CFL to big data: (1) the participating FL clients are unlikely to have identical data distributions; (2) insufficient attention is given to the similarity between different clusters; and (3) CFL tends to ignore the class imbalance problem (i.e., long-tailed), which hinders its application in big data and affects the quality of target tasks. To address these issues and enable widespread CFL deployment in big data applications, this paper proposes a prototype-assisted clustered federated learning framework (MDSPFL). It relaxes the assumption of unique data distribution for each client, allowing the client’s local dataset to follow multiple source distributions considering classification class imbalance, thereby aligning with clients in a big data environment. Specifically, MDSPFL employs the proximal update mechanism to handle workload surges caused by mixed distribution and unavailability of similarity between cluster models. Additionally, MDSPFL introduces a class-balanced local training mechanism to resolve the long-tailed problem, which utilizes contrastive learning and class prototypes to enforce a uniform distribution of all classes in the feature space. We conduct extensive experiments on different datasets (EMNIST, Cifar10, Cifar100), and the experimental results demonstrate the effectiveness of our proposed MDSPFL in big data scenarios with imbalance and mixed-distribution clients.
What problem does this paper attempt to address?