Abstract:This position paper investigates the integration of Differential Privacy (DP) in the training of Mixture of Experts (MoE) models within the field of natural language processing. As Large Language Models (LLMs) scale to billions of parameters, leveraging expansive datasets, they exhibit enhanced linguistic capabilities and emergent abilities. However, this growth raises significant computational and privacy concerns. Our study addresses these issues by exploring the potential of MoE models, known for their computational efficiency, and the application of DP, a standard for privacy preservation. We present the first known attempt to train MoE models under the constraints of DP, addressing the unique challenges posed by their architecture and the complexities of DP integration. Our initial experimental studies demonstrate that MoE models can be effectively trained with DP, achieving performance that is competitive with their non-private counterparts. This initial study aims to provide valuable insights and ignite further research in the domain of privacy-preserving MoE models, softly laying the groundwork for prospective developments in this evolving field.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to apply Differential Privacy (DP) to the training of the Mixture of Experts (MoE) model, especially in the field of natural language processing. With the increase in the number of parameters of Large Language Models (LLMs) and their reliance on huge datasets, although these models show stronger language - processing capabilities and emerging capabilities, they also bring significant computational costs and privacy issues. Specifically, this research mainly focuses on the following aspects: 1. **Computational Efficiency and Privacy Protection**: - The training of large - scale language models requires a large amount of computational resources, and the MoE model can allocate computational resources more efficiently due to its architectural characteristics, thereby reducing computational costs. - Differential privacy is a strict privacy protection standard, ensuring that even when training a model on a dataset containing sensitive data, individual data can be protected from being leaked. 2. **Challenges in Integrating MoE Model and DP**: - The special architecture of the MoE model (such as sparse gating mechanism, expert selection, etc.) makes it complicated to introduce differential privacy during its training process. - For example, when implementing differential privacy, it is necessary to calculate the gradient of each sample, which poses new technical challenges to the MoE model. 3. **Experimental Verification**: - Researchers have conducted preliminary experiments to verify that the MoE model can still be effectively trained under the requirement of differential privacy, and its performance is equivalent or close to the non - privacy version. Through these studies, the author hopes to provide valuable insights for future research, promote the development of the MoE model under privacy protection, and provide theoretical and technical support for building larger - scale models. ### Main Contributions - **First Attempt**: This is the first known work to attempt to apply differential privacy in the training of the MoE model. - **Solving Technical Problems**: Identify and solve the main challenges encountered when combining the MoE model with differential privacy, especially regarding the per - sample gradient calculation problem. - **Experimental Verification**: Through experiments, it is shown that the MoE model can achieve differential privacy protection while maintaining performance. ### Conclusions and Future Work The author points out that although some progress has been made, there are still many open questions and potential research directions, such as improving the load - balancing loss function, directly integrating differential privacy in the expert selection process, etc. In addition, it is necessary to expand the scope of experiments to cover more types of datasets and tasks in order to better understand the trade - off between privacy and utility.

Differentially Private Training of Mixture of Experts Models

Private Knowledge Transfer via Model Distillation with Generative Adversarial Networks

Differentially Private Next-Token Prediction of Large Language Models

DPMLBench: Holistic Evaluation of Differentially Private Machine Learning

Adaptively Private Next-Token Prediction of Large Language Models

Differentially Private Language Models Benefit from Public Pre-training

Efficient and Private: Memorisation under differentially private parameter-efficient fine-tuning in language models

LMO-DP: Optimizing the Randomization Mechanism for Differentially Private Fine-Tuning (Large) Language Models

DP-EM: Differentially Private Expectation Maximization

Private, Efficient, and Accurate: Protecting Models Trained by Multi-party Learning with Differential Privacy

Fine-Tuning Large Language Models with User-Level Differential Privacy

Differentially Private Language Models for Secure Data Sharing

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

Mind the Privacy Unit! User-Level Differential Privacy for Language Model Fine-Tuning

Differentially Private Distributed Learning for Language Modeling Tasks

Differentially Private Natural Language Models: Recent Advances and Future Directions

Model-Based Differentially Private Knowledge Transfer for Large Language Models

Skellam Mixture Mechanism: a Novel Approach to Federated Learning with Differential Privacy

DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

Differentially Private and Adversarially Robust Machine Learning: An Empirical Evaluation

Optimal Differentially Private Model Training with Public Data