Privacy-Preserving Data Deduplication for Enhancing Federated Learning of Language Models (Extended Version)

Aydin Abadi,Vishnu Asutosh Dasu,Sumanta Sarkar
2024-12-05
Abstract:Deduplication is a vital preprocessing step that enhances machine learning model performance and saves training time and energy. However, enhancing federated learning through deduplication poses challenges, especially regarding scalability and potential privacy violations if deduplication involves sharing all clients' data. In this paper, we address the problem of deduplication in a federated setup by introducing a pioneering protocol, Efficient Privacy-Preserving Multi-Party Deduplication (EP-MPD). It efficiently removes duplicates from multiple clients' datasets without compromising data privacy. EP-MPD is constructed in a modular fashion, utilizing two novel variants of the Private Set Intersection protocol. Our extensive experiments demonstrate the significant benefits of deduplication in federated learning of large language models. For instance, we observe up to 19.62\% improvement in perplexity and up to 27.95\% reduction in running time while varying the duplication level between 10\% and 30\%. EP-MPD effectively balances privacy and performance in federated learning, making it a valuable solution for large-scale applications.
Cryptography and Security,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the Federated Learning (FL) environment, how to effectively remove duplicates in multiple client datasets without violating data privacy. Specifically: 1. **Importance of Deduplication**: - Deduplication is an important pre - processing step before machine learning model training, which can improve model performance and save training time and energy. - For Language Models (LMs), duplicate data sequences can cause the model to over - fit, affect generalization ability, and increase the risk of privacy leakage. 2. **Existing Challenges**: - Introducing deduplication in Federated Learning faces two major challenges: scalability and potential privacy issues. - If the deduplication process involves sharing all client data, it may lead to privacy leakage. 3. **Solutions**: - The authors propose a new protocol - Efficient Privacy - Preserving Multi - Party Deduplication (EP - MPD). This protocol can efficiently remove duplicates from multiple client datasets without violating data privacy. - EP - MPD utilizes two new variants of the Private Set Intersection (PSI) protocol to ensure a balance between privacy and performance. 4. **Experimental Verification**: - Experimental results show that EP - MPD shows significant advantages in Federated Learning for large - scale language models. For example, under different duplication levels (10% to 30%), the perplexity is improved by up to 19.62% at most, and the running time is reduced by up to 27.95% at most. Through these improvements, EP - MPD provides a valuable solution for Federated Learning in large - scale applications, which not only improves model performance but also protects data privacy. ### Involved Formulas - **Perplexity (PP)**: \[ PP(Y)=\exp\left(-\frac{1}{n}\sum_{i = 1}^{n}\log(\Theta(y_i|y_1,\ldots,y_{i - 1}))\right) \] where \(Y = \{y_1,y_2,\ldots,y_n\}\) is a sequence containing \(n\) tokens and \(\Theta\) is a language model. - **Negative Log - Likelihood Loss**: \[ L(\Theta,Y)=-\sum_{i = 1}^{n}\log(\Theta(y_i|y_1,\ldots,y_{i - 1})) \] - **Global Model Update**: \[ \Theta=\frac{1}{d}\sum_{i = 1}^{n}d_i\theta_i \] where \(d=\sum_{i = 1}^{n}d_i\) is the total size of the dataset \(S=\cup_{i = 1}^{n}S_i\), and \(\theta_i\) is the local model trained by client \(D_i\) on its local dataset \(S_i\). Through these formulas and methods, the paper shows how to effectively improve the training efficiency and performance of language models in Federated Learning while ensuring privacy.