Secure and Fault Tolerant Decentralized Learning

Saurav Prakash,Hanieh Hashemi,Yongqin Wang,Murali Annavaram,Salman Avestimehr
DOI: https://doi.org/10.48550/arXiv.2010.07541
2022-09-13
Abstract:Federated learning (FL) is a promising paradigm for training a global model over data distributed across multiple data owners without centralizing clients' raw data. However, sharing of local model updates can also reveal information of clients' local datasets. Trusted execution environments (TEEs) within the FL server have been recently deployed by companies like Meta for secure aggregation. However, secure aggregation can suffer from error-prone local updates sent by clients that become faulty during training due to underlying device malfunctions. Also, data heterogeneity across clients makes fault mitigation challenging, as even updates from normal clients are dissimilar. Thus, most of the prior fault tolerant methods, which treat any local update differing from the majority of other updates as faulty, perform poorly. We propose DiverseFL to make model aggregation secure as well as robust to faults. In DiverseFL, any client whose local model update diverges from its associated guiding update is tagged as being faulty. To implement our novel per-client criteria for fault mitigation, DiverseFL creates a TEE-based secure enclave within the FL server, which in addition to performing secure aggregation for carrying out the global model update step, securely receives a small representative sample of local data from each client only once before training, and computes guiding updates for each participating client during training. Thus, DiverseFL provides security against privacy leakage as well as robustness against faulty clients. In experiments, DiverseFL consistently achieves significant improvements in absolute test accuracy over prior fault mitigation benchmarks. DiverseFL also performs closely to OracleSGD, where server combines updates only from the normal clients. We also analyze the convergence rate of DiverseFL under non-IID data and standard convexity assumptions.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to simultaneously achieve data privacy protection and robustness against faulty nodes in Federated Learning (FL). Specifically, the paper focuses on how to protect the data privacy of clients participating in training through Secure Aggregation technology in the case of Non - IID (Non - Independently and Identically Distributed) data, and at the same time effectively detect and exclude Faulty Updates caused by hardware or software failures to ensure the performance and accuracy of model training. ### Background and Challenges - **Privacy Protection**: In Federated Learning, clients train models locally using their own data and send model updates to the central server for global model updates. However, this sharing of local model updates may leak clients' data information, so Secure Aggregation technology needs to be adopted to protect data privacy. - **Fault Tolerance**: In practical applications, clients may send incorrect model updates for various reasons (such as hardware failures, software errors, etc.). These abnormal updates will seriously affect the performance of model training. Especially in the case of Non - IID data, the updates of normal clients may themselves vary greatly, and traditional similarity - based fault detection methods perform poorly in this situation. ### Solutions The paper proposes a new method named DiverseFL, which combines Trusted Execution Environment (TEE) and client - specific fault detection criteria to achieve the following goals: - **Secure Aggregation**: Utilize the secure isolated execution environment provided by TEE (such as Intel SGX) to ensure that clients' model updates are not leaked during transmission and aggregation. - **Fault Detection**: Each client provides a small representative data sample to TEE before training. In each round of training, TEE calculates the "Guiding Update" for each client based on these samples and compares it with the actual update uploaded by the client. It judges whether the client is faulty through two indicators: Direction Similarity and Length Similarity. - **Robustness**: Through the above methods, DiverseFL can effectively identify and exclude the updates of faulty nodes, thereby improving the robustness of model training and the final test accuracy. ### Experimental Results - **Performance Improvement**: The experimental results show that the test accuracy of DiverseFL on multiple benchmark datasets is significantly higher than that of existing fault - tolerance methods, with an absolute improvement of up to about 39%. - **Close to OracleSGD**: The performance of DiverseFL is close to that of OracleSGD (that is, the server only aggregates the updates of normal clients), indicating that its fault detection mechanism is very effective. - **Scalability**: The experiment also verifies the scalability of DiverseFL. A single TEE can support up to 316 clients without causing significant latency. ### Conclusion DiverseFL successfully solves the privacy protection and fault tolerance problems in Federated Learning by combining the security features of TEE and client - specific fault detection criteria, especially performing well in the case of Non - IID data. This method provides strong support for the reliability and security of Federated Learning in practical applications.