Training Data Leakage Analysis in Language Models

Huseyin A. Inan,Osman Ramadan,Lukas Wutschitz,Daniel Jones,Victor Rühle,James Withers,Robert Sim
DOI: https://doi.org/10.48550/arXiv.2101.05405
2021-02-23
Abstract:Recent advances in neural network based language models lead to successful deployments of such models, improving user experience in various applications. It has been demonstrated that strong performance of language models comes along with the ability to memorize rare training samples, which poses serious privacy threats in case the model is trained on confidential user content. In this work, we introduce a methodology that investigates identifying the user content in the training data that could be leaked under a strong and realistic threat model. We propose two metrics to quantify user-level data leakage by measuring a model's ability to produce unique sentence fragments within training data. Our metrics further enable comparing different models trained on the same data in terms of privacy. We demonstrate our approach through extensive numerical studies on both RNN and Transformer based models. We further illustrate how the proposed metrics can be utilized to investigate the efficacy of mitigations like differentially private training or API hardening.
Cryptography and Security,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the risk of training data leakage in language models. Specifically, the paper focuses on the fact that when language models are trained on data containing users' sensitive information, these models may remember and leak rare samples in the training data, thus leading to privacy leakage. This leakage may occur when the model is used to generate text or when the training samples are reconstructed through probe attacks. The paper proposes a methodology to identify and quantify this privacy risk, especially under the strict black - box assumption, that is, the attacker can only access the top k predictions of the model under a given input prefix. Through this methodology, the author hopes to be able to compare the privacy performance of different models on the same training data and evaluate the effectiveness of mitigation measures (such as differential privacy training or API strengthening).