Abstract:Federated learning is a decentralized learning paradigm wherein a central server trains a global model iteratively by utilizing clients who possess a certain amount of private datasets. The challenge lies in the fact that the client side private data may not be identically and independently distributed, significantly impacting the accuracy of the global model. Existing methods commonly address the Non-IID challenge by focusing on optimization, client selection and data complement. However, most approaches tend to overlook the perspective of the private data itself due to privacy <a class="link-external link-http" href="http://constraints.Intuitively" rel="external noopener nofollow">this http URL</a>, statistical distinctions among private data on the client side can help mitigate the Non-IID degree. Besides, the recent advancements in dataset condensation technology have inspired us to investigate its potential applicability in addressing Non-IID issues while maintaining privacy. Motivated by this, we propose DCFL which divides clients into groups by using the Centered Kernel Alignment (CKA) method, then uses dataset condensation methods with non-IID awareness to complete clients. The private data from clients within the same group is complementary and their condensed data is accessible to all clients in the group. Additionally, CKA-guided client selection strategy, filtering mechanisms, and data enhancement techniques are incorporated to efficiently and precisely utilize the condensed data, enhance model performance, and minimize communication time. Experimental results demonstrate that DCFL achieves competitive performance on popular federated learning benchmarks including MNIST, FashionMNIST, SVHN, and CIFAR-10 with existing FL protocol.
Machine Learning,Artificial Intelligence,Cryptography and Security,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the non - independent and identically distributed (Non - IID) data problem in federated learning (Federated Learning, FL). Specifically, the challenge in federated learning is that the private data of clients may not be independent and identically distributed, which significantly affects the accuracy of the global model. Existing methods usually deal with the Non - IID challenge through optimization, client selection, and data supplementation, but these methods often overlook the perspective of the private data itself due to privacy limitations.
To explain this problem in more detail, we can use the following formula to represent the impact of Non - IID data:
\[
\text{weight divergence} = \frac{\|w_{\text{FedAvg}} - w_{\text{SGD}}\|}{\|w_{\text{SGD}}\|}
\]
where \( w_{\text{FedAvg}} \) is the weight trained by the federated averaging algorithm (FedAvg), and \( w_{\text{SGD}} \) is the weight trained using the global data set (assuming the server knows all data distributions). Research shows that Non - IID data can lead to an increase in model weight differences, thereby affecting model performance.
Furthermore, the paper points out that although existing methods perform well in some Non - IID scenarios, they cannot consistently outperform other algorithms and cannot change the inherent Non - IID characteristics of client data. Therefore, the authors propose a new framework - DCFL (Data Condensation aided Federated Learning with Non - IID awareness), aiming to mitigate the negative impacts of Non - IID data on federated learning model training, communication, and performance by efficiently using condensed data.
### Main contributions
1. **Client complementarity based on CKA**: Introduce the Centered Kernel Alignment (CKA) method to measure the complementarity between clients, guiding client selection and condensed data transmission. The server - side calculates the complementarity between each client and other clients, and then groups the clients according to the complementarity, thereby achieving more fine - grained client selection, reducing the overall communication cost and improving the final model performance.
2. **Condensed data - assisted client model training with Non - IID awareness**: When the client model is trained, the real data cooperates with the condensed data from other clients in the same complementary group. In addition, the DSA (Differentiable Siamese Augmentation) data augmentation technique is also used, and the weight calculation formula of participating clients is re - organized according to the change in the number of local data sets of clients, to further reduce the number of communication rounds, make the training process more stable, and ultimately improve the model performance.
3. **Experimental verification**: Use four public data sets, MNIST, Fashion MNIST, SVHN, and CIFAR - 10, to verify the effectiveness of the DCFL algorithm. The experimental results show that DCFL outperforms traditional federated learning methods in terms of test accuracy and communication cost in different scenarios.
In summary, the main goal of this paper is to effectively deal with the Non - IID data problem in federated learning, improve model performance, and reduce communication overhead by introducing novel data condensation techniques and client selection strategies.