Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration

Wenjie Fu,Huandong Wang,Chen Gao,Guanghua Liu,Yong Li,Tao Jiang
2024-06-25
Abstract:Membership Inference Attacks (MIA) aim to infer whether a target data record has been utilized for model training or not. Prior attempts have quantified the privacy risks of language models (LMs) via MIAs, but there is still no consensus on whether existing MIA algorithms can cause remarkable privacy leakage on practical Large Language Models (LLMs). Existing MIAs designed for LMs can be classified into two categories: reference-free and reference-based attacks. They are both based on the hypothesis that training records consistently strike a higher probability of being sampled. Nevertheless, this hypothesis heavily relies on the overfitting of target models, which will be mitigated by multiple regularization methods and the generalization of LLMs. The reference-based attack seems to achieve promising effectiveness in LLMs, which measures a more reliable membership signal by comparing the probability discrepancy between the target model and the reference model. However, the performance of reference-based attack is highly dependent on a reference dataset that closely resembles the training dataset, which is usually inaccessible in the practical scenario. Overall, existing MIAs are unable to effectively unveil privacy leakage over practical fine-tuned LLMs that are overfitting-free and private. We propose a Membership Inference Attack based on Self-calibrated Probabilistic Variation (SPV-MIA). Specifically, since memorization in LLMs is inevitable during the training process and occurs before overfitting, we introduce a more reliable membership signal, probabilistic variation, which is based on memorization rather than overfitting. Furthermore, we introduce a self-prompt approach, which constructs the dataset to fine-tune the reference model by prompting the target LLM itself. In this manner, the adversary can collect a dataset with a similar distribution from public APIs.
Computation and Language,Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges and limitations faced by existing Membership Inference Attacks (MIA) when applied to Large Language Models (LLMs). Specifically: 1. **Dependence on the over - fitting assumption**: Existing MIA methods usually assume that training records have a higher sampling probability than non - training records. This assumption only holds in the case of model over - fitting. However, LLMs reduce the possibility of over - fitting through various regularization methods and generalization capabilities, resulting in a high false - positive rate in the practical application of these methods. 2. **Dependence on the reference data set**: Existing reference models rely on a reference data set with a distribution similar to that of the training data set. But in actual scenarios, it is very difficult to obtain such a high - quality reference data set, which significantly reduces the performance of MIA based on the reference model. To solve these problems, the authors propose a Membership Inference Attack based on Self - calibrated Probability Variation (SPV - MIA). The main innovations of this method include: - **Self - prompt method**: By prompting the target LLM to generate text, a reference data set with a distribution similar to that of the training data set is constructed, thus avoiding the dependence on a high - quality reference data set. - **Probability variation evaluation**: A new membership signal - Probabilistic Variation - is introduced. It is based on the memory of the LLM rather than over - fitting and can detect member records more reliably. Through these two modules, SPV - MIA significantly improves the performance of MIA on multiple data sets and LLMs, and the AUC value has increased from 0.7 to over 0.9. ### Formula summary 1. **Joint probability maximization**: \[ L_{\text{CLM}} = -\frac{1}{M} \sum_{j = 1}^{M} \sum_{i = 1}^{|x^{(j)}|} \log p_\theta(t_i|x^{(j)}_{<i}) \] where \(M\) is the number of training records, and \(p_\theta(t_i|x^{(j)}_{<i})\) is the probability of predicting the next word given the prefix \(x^{(j)}_{<i}\). 2. **Definition of probability variation**: \[ e p_\theta(x):=\mathbb{E}_z[z^{\top}H_p(x)z] \] where \(H_p(x)\) is the Hessian matrix of the probability function \(p_\theta(x)\), and \(z^{\top}H_p(x)z\) represents the second - order directional derivative in the direction \(z\). 3. **Symmetric approximation**: \[ z^{\top}H_p(x)z\approx\frac{p_\theta(x + hz)+p_\theta(x - hz)- 2p_\theta(x)}{h^2} \] Further simplified as: \[ e p_\theta(x)\approx\frac{1}{2N}\sum_{n = 1}^{N}(p_\theta(e x^+_{n})+p_\theta(e x^-_{n}))-p_\theta(x) \] where \(e x^{\pm}_{n}=x\pm z_n\) are symmetric text pairs generated by the synonymous sentence model. Through these improvements, SPV - MIA can more accurately detect member records without relying on a high - quality reference data set, thereby revealing the potential risks of LLMs in terms of privacy leakage.