Adversarial Robustness of In-Context Learning in Transformers for Linear Regression

Usman Anwar,Johannes Von Oswald,Louis Kirsch,David Krueger,Spencer Frei
2024-11-08
Abstract:Transformers have demonstrated remarkable in-context learning capabilities across various domains, including statistical learning tasks. While previous work has shown that transformers can implement common learning algorithms, the adversarial robustness of these learned algorithms remains unexplored. This work investigates the vulnerability of in-context learning in transformers to \textit{hijacking attacks} focusing on the setting of linear regression tasks. Hijacking attacks are prompt-manipulation attacks in which the adversary's goal is to manipulate the prompt to force the transformer to generate a specific output. We first prove that single-layer linear transformers, known to implement gradient descent in-context, are non-robust and can be manipulated to output arbitrary predictions by perturbing a single example in the in-context training set. While our experiments show these attacks succeed on linear transformers, we find they do not transfer to more complex transformers with GPT-2 architectures. Nonetheless, we show that these transformers can be hijacked using gradient-based adversarial attacks. We then demonstrate that adversarial training enhances transformers' robustness against hijacking attacks, even when just applied during finetuning. Additionally, we find that in some settings, adversarial training against a weaker attack model can lead to robustness to a stronger attack model. Lastly, we investigate the transferability of hijacking attacks across transformers of varying scales and initialization seeds, as well as between transformers and ordinary least squares (OLS). We find that while attacks transfer effectively between small-scale transformers, they show poor transferability in other scenarios (small-to-large scale, large-to-large scale, and between transformers and OLS).
Machine Learning,Cryptography and Security
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to explore and solve the following problems: 1. **Adversarial Robustness of Transformer Models in In - context Learning**: - The paper studies whether the in - context learning ability of Transformer models is vulnerable to "hijacking attacks" when performing linear regression tasks. Hijacking attacks refer to forcing the Transformer to generate specific outputs by manipulating certain samples in the prompt. 2. **Vulnerability of Single - layer Linear Transformers**: - The author proves that single - layer linear Transformers are very vulnerable in in - context learning, and their prediction results can be manipulated by making small perturbations to a single sample in the training set. Specifically, by adding an adversarial sample \((x_{adv}, y_{adv})\), an attacker can force the Transformer to produce an arbitrary prediction value \(y_{bad}\). 3. **Robustness of Complex Transformer Architectures (such as GPT - 2)**: - The paper further studies the robustness of more complex Transformer architectures (such as GPT - 2) when facing hijacking attacks. Experiments show that the attack methods effective against single - layer linear Transformers cannot be directly transferred to the GPT - 2 architecture, but effective attack methods can still be found through gradient - based optimization methods. 4. **Improvement of Transformer Robustness by Adversarial Training**: - The author explores whether adversarial training can improve the robustness of Transformers when facing hijacking attacks. The results show that whether adversarial training is carried out in the pre - training stage or the fine - tuning stage, it can significantly enhance the robustness of Transformers. In some cases, adversarial training can be carried out on fewer samples but can resist more attacks. 5. **Transferability of Attacks between Transformers of Different Scales and Initialization Seeds**: - The study also examines the transferability of hijacking attacks between Transformers of different scales and initialization seeds. It is found that the transfer effect of attacks between low - capacity Transformers is good, but the transferability between high - capacity Transformers or across different types of models (such as the ordinary least squares model) is poor. ### Formula Presentation To ensure the correctness and readability of the formulas, the following are some key formulas involved in the paper: - **Output Formula of Linear Transformer**: \[ f_{LSA}(E; \theta)=E + W_{PV}E\cdot E^{\top}W_{KQ}/N \] where \(W_{PV}\) and \(W_{KQ}\) are the projection and query matrices respectively. - **Objective Function**: \[ bL(\theta)=\frac{1}{2B}\sum_{\tau = 1}^{B}\left[f(E_{\tau};\theta)\right]_{d + 1,N + 1}-y_{\tau,query}\right)^{2} \] - **Formula for the Target Label of Adversarial Samples**: \[ y_{bad}=(1-\alpha)w^{\top}x_{query}+\alpha w_{\perp}^{\top}x_{query} \] where \(w\) is the original weight vector, \(w_{\perp}\) is the vector orthogonal to \(w\), and \(\alpha\in[0,1]\) controls the distribution characteristics of the target label. Through these studies, the paper reveals the potential security problems of Transformers in in - context learning and proposes possible improvement methods.