Self-Evaluation as a Defense Against Adversarial Attacks on LLMs

Hannah Brown,Leon Lin,Kenji Kawaguchi,Michael Shieh
2024-08-06
Abstract:We introduce a defense against adversarial attacks on LLMs utilizing self-evaluation. Our method requires no model fine-tuning, instead using pre-trained models to evaluate the inputs and outputs of a generator model, significantly reducing the cost of implementation in comparison to other, finetuning-based methods. Our method can significantly reduce the attack success rate of attacks on both open and closed-source LLMs, beyond the reductions demonstrated by Llama-Guard2 and commonly used content moderation APIs. We present an analysis of the effectiveness of our method, including attempts to attack the evaluator in various settings, demonstrating that it is also more resilient to attacks than existing methods. Code and data will be made available at <a class="link-external link-https" href="https://github.com/Linlt-leon/self-eval" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Computation and Language,Cryptography and Security
What problem does this paper attempt to address?
This paper attempts to address the security issues of large - language models (LLMs) when facing adversarial attacks. Specifically, the author proposes a self - evaluation - based defense method to prevent the impact of adversarial attacks on LLMs. ### Core Issues of the Paper 1. **Threat of Adversarial Attacks**: Although existing LLMs are trained to generate useful, harmless, and honest outputs, there are still various adversarial attack means (such as Zou et al., 2023; Zhu et al., 2023), and these attacks can bypass existing protection measures, causing the model to generate harmful content. 2. **Limitations of Existing Defense Methods**: Current defense methods include training through reinforcement learning with human feedback (RLHF), setting up guardrails during inference (Rebedea et al., 2023), and other methods for detecting harmful outputs (Team, 2024; Hu et al., 2024). However, these methods are either costly or rely on fine - tuning the model or proprietary APIs (such as OpenAI's content moderation API), and have certain limitations and vulnerabilities. ### Proposed Solution The author proposes a new defense method - **self - evaluation - based defense**, with the following main features: - **No Fine - Tuning Required**: This method does not require additional fine - tuning of the model, but uses a pre - trained model to evaluate the input and output of the generator model. - **Efficient Implementation**: Compared with other fine - tuning - based methods, this method significantly reduces the implementation cost and can effectively reduce the attack success rate on open - source and closed - source LLMs. - **Robustness**: This method can not only effectively reduce the success rate of adversarial attacks, but also shows higher robustness in different scenarios (such as attacks against the evaluator), outperforming existing defense methods (such as Llama - Guard2 and common content moderation APIs). ### Experimental Results Through experiments, the author demonstrates the effectiveness of this method in multiple settings: - **Significantly Reducing the Attack Success Rate (ASR)**: For inputs with adversarial suffixes, this method can reduce the attack success rate from 95.0% to 0.0%, significantly outperforming other defense methods. - **Strong Adaptability**: Even in the scenario of adaptive attacks, this method still shows strong resistance. Although an attacker can try to attack by training an adversarial suffix specifically for the evaluator, in the worst - case scenario, the attack success rate of this method is still lower than that of an unprotected generator. ### Conclusion The paper shows that pre - trained LLMs can accurately identify attacked inputs and outputs through self - evaluation, thus providing a powerful and easy - to - implement defense mechanism. Although there are attack means against this defense method, self - evaluation is still one of the strongest defense strategies at present, which can effectively resist adversarial attacks without degrading the model performance. --- If you need to further understand the specific experimental details or formula derivations, please feel free to let us know!