Data Poisoning for In-context Learning

Pengfei He,Han Xu,Yue Xing,Hui Liu,Makoto Yamada,Jiliang Tang
2024-03-28
Abstract:In the domain of large language models (LLMs), in-context learning (ICL) has been recognized for its innovative ability to adapt to new tasks, relying on examples rather than retraining or fine-tuning. This paper delves into the critical issue of ICL's susceptibility to data poisoning attacks, an area not yet fully explored. We wonder whether ICL is vulnerable, with adversaries capable of manipulating example data to degrade model performance. To address this, we introduce ICLPoison, a specialized attacking framework conceived to exploit the learning mechanisms of ICL. Our approach uniquely employs discrete text perturbations to strategically influence the hidden states of LLMs during the ICL process. We outline three representative strategies to implement attacks under our framework, each rigorously evaluated across a variety of models and tasks. Our comprehensive tests, including trials on the sophisticated GPT-4 model, demonstrate that ICL's performance is significantly compromised under our framework. These revelations indicate an urgent need for enhanced defense mechanisms to safeguard the integrity and reliability of LLMs in applications relying on in-context learning.
Cryptography and Security
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **the vulnerability of in - context learning (ICL) in large language models (LLMs) to data - poisoning attacks**. Specifically, the authors explored whether ICL is vulnerable to maliciously tampered example data, which may degrade the model performance. They proposed and verified an ICL - specific data - poisoning attack framework, ICLPoison, to expose the vulnerability of ICL under such attacks and emphasized the need to strengthen defense mechanisms to protect the integrity and reliability of applications relying on ICL. ### Background and Problem Description 1. **Advantages and Limitations of In - context Learning (ICL)** - ICL is a method that allows large language models (LLMs) to learn tasks through a small number of examples without retraining or fine - tuning the model parameters. - Although ICL performs well in terms of flexibility and efficiency, its performance is very sensitive to the selection and order of example data. 2. **The Threat of Data - poisoning Attacks** - Malicious actors may reduce model performance by tampering with the example data used for ICL. - For example, an attacker can strategically change the content in the example data, causing the model to make inaccurate or biased predictions. 3. **Insufficiencies in Existing Research** - Currently, the research on the vulnerability of ICL under data - poisoning attacks is not sufficient. - Traditional data - poisoning attack methods mainly target explicit training processes and loss functions, while ICL has no explicit training objective, so traditional methods are not directly applicable to ICL. ### Main Contributions of the Paper 1. **Introduction of the ICLPoison Framework** - ICLPoison is a data - poisoning attack framework specifically designed for ICL, which strategically affects the hidden states of LLMs through discrete text perturbations. - The framework proposes three representative attack strategies: synonym substitution, character substitution, and adversarial suffix. 2. **Experimental Verification** - The authors verified the effectiveness of ICLPoison through extensive experiments, including tests on multiple LLMs (such as GPT - 4). - The experimental results show that the performance of ICL drops significantly when under ICLPoison attack, indicating that ICL is indeed vulnerable to data - poisoning attacks. ### Formula Summary In the paper, the authors used the following formulas to measure the change in hidden states: \[ l_d(h_l(x_p^{(i,t)}, f), h_l(\delta_i(x_p^{(i,t)}), f))=\left\|\frac{h_l(x_p^{(i,t)}, f)}{\|h_l(x_p^{(i,t)}, f)\|_2}-\frac{h_l(\delta_i(x_p^{(i,t)}), f)}{\|h_l(\delta_i(x_p^{(i,t)}), f)\|_2}\right\|_2 \] \[ L_d(H(x_p^{(i,t)}, f), H(\delta_i(x_p^{(i,t)}), f)) = \min_{l\in[L]}l_d(h_l(x_p^{(i,t)}, f), h_l(\delta_i(x_p^{(i,t)}), f)) \] The goal of the attack is to maximize the minimum difference: \[ \max_{\delta_i\in\Delta}L_d(H(x_p^{(i,t)}, f), H(\delta_i(x_p^{(i,t)}), f)) \] These formulas are used to quantify the change in hidden states, thereby evaluating the effect of the attack. ### Conclusion By introducing the ICLPoison framework, this paper systematically studied the vulnerability of ICL under data - poisoning attacks for the first time, revealed the sensitivity of ICL to data tampering at different levels, and emphasized the importance of strengthening defense mechanisms.