Abstract:Deduplication is a vital preprocessing step that enhances machine learning model performance and saves training time and energy. However, enhancing federated learning through deduplication poses challenges, especially regarding scalability and potential privacy violations if deduplication involves sharing all clients' data. In this paper, we address the problem of deduplication in a federated setup by introducing a pioneering protocol, Efficient Privacy-Preserving Multi-Party Deduplication (EP-MPD). It efficiently removes duplicates from multiple clients' datasets without compromising data privacy. EP-MPD is constructed in a modular fashion, utilizing two novel variants of the Private Set Intersection protocol. Our extensive experiments demonstrate the significant benefits of deduplication in federated learning of large language models. For instance, we observe up to 19.62\% improvement in perplexity and up to 27.95\% reduction in running time while varying the duplication level between 10\% and 30\%. EP-MPD effectively balances privacy and performance in federated learning, making it a valuable solution for large-scale applications.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the Federated Learning (FL) environment, how to effectively remove duplicates in multiple client datasets without violating data privacy. Specifically: 1. **Importance of Deduplication**: - Deduplication is an important pre - processing step before machine learning model training, which can improve model performance and save training time and energy. - For Language Models (LMs), duplicate data sequences can cause the model to over - fit, affect generalization ability, and increase the risk of privacy leakage. 2. **Existing Challenges**: - Introducing deduplication in Federated Learning faces two major challenges: scalability and potential privacy issues. - If the deduplication process involves sharing all client data, it may lead to privacy leakage. 3. **Solutions**: - The authors propose a new protocol - Efficient Privacy - Preserving Multi - Party Deduplication (EP - MPD). This protocol can efficiently remove duplicates from multiple client datasets without violating data privacy. - EP - MPD utilizes two new variants of the Private Set Intersection (PSI) protocol to ensure a balance between privacy and performance. 4. **Experimental Verification**: - Experimental results show that EP - MPD shows significant advantages in Federated Learning for large - scale language models. For example, under different duplication levels (10% to 30%), the perplexity is improved by up to 19.62% at most, and the running time is reduced by up to 27.95% at most. Through these improvements, EP - MPD provides a valuable solution for Federated Learning in large - scale applications, which not only improves model performance but also protects data privacy. ### Involved Formulas - **Perplexity (PP)**: \[ PP(Y)=\exp\left(-\frac{1}{n}\sum_{i = 1}^{n}\log(\Theta(y_i|y_1,\ldots,y_{i - 1}))\right) \] where \(Y = \{y_1,y_2,\ldots,y_n\}\) is a sequence containing \(n\) tokens and \(\Theta\) is a language model. - **Negative Log - Likelihood Loss**: \[ L(\Theta,Y)=-\sum_{i = 1}^{n}\log(\Theta(y_i|y_1,\ldots,y_{i - 1})) \] - **Global Model Update**: \[ \Theta=\frac{1}{d}\sum_{i = 1}^{n}d_i\theta_i \] where \(d=\sum_{i = 1}^{n}d_i\) is the total size of the dataset \(S=\cup_{i = 1}^{n}S_i\), and \(\theta_i\) is the local model trained by client \(D_i\) on its local dataset \(S_i\). Through these formulas and methods, the paper shows how to effectively improve the training efficiency and performance of language models in Federated Learning while ensuring privacy.

Privacy-Preserving Data Deduplication for Enhancing Federated Learning of Language Models (Extended Version)

Privacy-Preserving Collaborative Deep Learning with Unreliable Participants.

Efficient Federated Learning with Pre-Trained Large Language Model Using Several Adapter Mechanisms

Differentially Private Low-Rank Adaptation of Large Language Model Using Federated Learning

Privacy-preserving Decentralized Aggregation for Federated Learning

A Practical Privacy-preserving Method in Federated Deep Learning

A Lightweight and Accuracy-Lossless Privacy-Preserving Method in Federated Learning

Privacy, accuracy, and model fairness trade-offs in federated learning

Performance Analysis and Optimization in Privacy-Preserving Federated Learning

SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training

Clients Collaborate: Flexible Differentially Private Federated Learning with Guaranteed Improvement of Utility-Privacy Trade-off

Deduplicating Training Data Makes Language Models Better

Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy

Towards Efficient and Privacy-Preserving Federated Deep Learning

EPFed: Achieving Optimal Balance between Privacy and Efficiency in Federated Learning

Towards Communication-Efficient and Privacy-Preserving Federated Representation Learning

Privacy preserving distributed machine learning with federated learning

DP-DyLoRA: Fine-Tuning Transformer-Based Models On-Device under Differentially Private Federated Learning using Dynamic Low-Rank Adaptation

Privacy-Preserving Federated Learning on Partitioned Attributes

FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication