LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked

Mansi Phute,Alec Helbling,Matthew Hull,ShengYun Peng,Sebastian Szyller,Cory Cornelius,Duen Horng Chau
2024-05-02
Abstract:Large language models (LLMs) are popular for high-quality text generation but can produce harmful content, even when aligned with human values through reinforcement learning. Adversarial prompts can bypass their safety measures. We propose LLM Self Defense, a simple approach to defend against these attacks by having an LLM screen the induced responses. Our method does not require any fine-tuning, input preprocessing, or iterative output generation. Instead, we incorporate the generated content into a pre-defined prompt and employ another instance of an LLM to analyze the text and predict whether it is harmful. We test LLM Self Defense on GPT 3.5 and Llama 2, two of the current most prominent LLMs against various types of attacks, such as forcefully inducing affirmative responses to prompts and prompt engineering attacks. Notably, LLM Self Defense succeeds in reducing the attack success rate to virtually 0 using both GPT 3.5 and Llama 2. The code is publicly available at <a class="link-external link-https" href="https://github.com/poloclub/llm-self-defense" rel="external noopener nofollow">this https URL</a>
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to address the issue that large language models (LLMs) may generate harmful content when producing text, even if these models have been aligned with human values through methods such as reinforcement learning. Specifically, adversarial prompts can bypass the safety measures of these models, leading them to generate malicious outputs. The paper proposes a method called LLM SELFDEFENSE, which aims to defend against these attacks by having the LLM self-check its generated content. ### Main Issues 1. **Harmful Content Generation**: While generating high-quality text, LLMs may also produce harmful content, such as phishing emails, malicious code, and hate speech. 2. **Adversarial Attacks**: Through specific prompt engineering or more advanced techniques like adversarial suffix attacks, LLMs can be manipulated to generate harmful content. 3. **Limitations of Existing Defense Methods**: Existing defense methods usually require fine-tuning, preprocessing, or iterative generation, which are complex and inefficient. ### Solution - **LLM SELFDEFENSE**: This is a zero-shot defense method that involves inserting the generated text into a predefined prompt format and using another LLM instance to analyze the text to determine if it contains harmful content. This method is simple, efficient, and does not require any modifications to the underlying model. ### Experimental Results - **Evaluation Models**: The paper evaluates the method on two popular LLMs (GPT 3.5 and Llama 2). - **Performance**: LLM SELFDEFENSE can reduce the attack success rate to almost zero, effectively identifying nearly all harmful text. - **Reducing False Positives**: By presenting harmful content detection as a suffix rather than a prefix to the LLM, the false positive rate can be significantly reduced. ### Conclusion LLM SELFDEFENSE provides a simple and effective defense mechanism that can prevent LLMs from generating harmful content without adding extra complexity and overhead. This method is significant for improving the safety and reliability of LLMs.