Differentially Private Language Models Benefit from Public Pre-training

Gavin Kerrigan,Dylan Slack,Jens Tuyls
DOI: https://doi.org/10.48550/arXiv.2009.05886
2020-10-27
Abstract:Language modeling is a keystone task in natural language processing. When training a language model on sensitive information, differential privacy (DP) allows us to quantify the degree to which our private data is protected. However, training algorithms which enforce differential privacy often lead to degradation in model quality. We study the feasibility of learning a language model which is simultaneously high-quality and privacy preserving by tuning a public base model on a private corpus. We find that DP fine-tuning boosts the performance of language models in the private domain, making the training of such models possible.
Machine Learning,Computation and Language,Cryptography and Security
What problem does this paper attempt to address?
This paper attempts to solve the problem of protecting personal privacy when training language models, especially when using datasets containing sensitive information for training. Specifically, the paper explores how to achieve a high - quality and privacy - protected language model by fine - tuning a publicly pre - trained language model. The research found that using Differential Privacy (DP) technology to fine - tune private datasets can not only improve the performance of the language model in the private domain, but also ensure the privacy protection of personal data during the model training process. The main challenges mentioned in the paper include: - **Model quality degradation**: Traditional differential privacy training methods often lead to a significant decline in model performance. - **Privacy protection requirements**: When dealing with datasets containing personal sensitive information, it is necessary to strictly protect the privacy of individuals, especially in sensitive fields such as medical care. - **Data distribution differences**: The distribution differences between public datasets and private datasets may lead to insufficient generalization ability of the model. To solve these problems, the author proposes a new method, that is, first pre - train a non - privacy language model on a large - scale public dataset, and then use differential privacy technology to fine - tune this model on the private dataset. The experimental results show that this method can not only improve the performance of the model, but also protect the privacy of data to a certain extent.