Abstract:Large language models (LLMs) are popular for high-quality text generation but can produce harmful content, even when aligned with human values through reinforcement learning. Adversarial prompts can bypass their safety measures. We propose LLM Self Defense, a simple approach to defend against these attacks by having an LLM screen the induced responses. Our method does not require any fine-tuning, input preprocessing, or iterative output generation. Instead, we incorporate the generated content into a pre-defined prompt and employ another instance of an LLM to analyze the text and predict whether it is harmful. We test LLM Self Defense on GPT 3.5 and Llama 2, two of the current most prominent LLMs against various types of attacks, such as forcefully inducing affirmative responses to prompts and prompt engineering attacks. Notably, LLM Self Defense succeeds in reducing the attack success rate to virtually 0 using both GPT 3.5 and Llama 2. The code is publicly available at <a class="link-external link-https" href="https://github.com/poloclub/llm-self-defense" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

This paper attempts to address the issue that large language models (LLMs) may generate harmful content when producing text, even if these models have been aligned with human values through methods such as reinforcement learning. Specifically, adversarial prompts can bypass the safety measures of these models, leading them to generate malicious outputs. The paper proposes a method called LLM SELFDEFENSE, which aims to defend against these attacks by having the LLM self-check its generated content. ### Main Issues 1. **Harmful Content Generation**: While generating high-quality text, LLMs may also produce harmful content, such as phishing emails, malicious code, and hate speech. 2. **Adversarial Attacks**: Through specific prompt engineering or more advanced techniques like adversarial suffix attacks, LLMs can be manipulated to generate harmful content. 3. **Limitations of Existing Defense Methods**: Existing defense methods usually require fine-tuning, preprocessing, or iterative generation, which are complex and inefficient. ### Solution - **LLM SELFDEFENSE**: This is a zero-shot defense method that involves inserting the generated text into a predefined prompt format and using another LLM instance to analyze the text to determine if it contains harmful content. This method is simple, efficient, and does not require any modifications to the underlying model. ### Experimental Results - **Evaluation Models**: The paper evaluates the method on two popular LLMs (GPT 3.5 and Llama 2). - **Performance**: LLM SELFDEFENSE can reduce the attack success rate to almost zero, effectively identifying nearly all harmful text. - **Reducing False Positives**: By presenting harmful content detection as a suffix rather than a prefix to the LLM, the false positive rate can be significantly reduced. ### Conclusion LLM SELFDEFENSE provides a simple and effective defense mechanism that can prevent LLMs from generating harmful content without adding extra complexity and overhead. This method is significant for improving the safety and reliability of LLMs.

LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked

An LLM can Fool Itself: A Prompt-Based Adversarial Attack

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Self-Evaluation as a Defense Against Adversarial Attacks on LLMs

Large Language Model Sentinel: Advancing Adversarial Robustness by LLM Agent

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

Large Language Model Sentinel: LLM Agent for Adversarial Purification

Universal and Transferable Adversarial Attacks on Aligned Language Models

Certifying LLM Safety against Adversarial Prompting

Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?

Exploring the Adversarial Capabilities of Large Language Models

Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM

Goal-Oriented Prompt Attack and Safety Evaluation for LLMs

A LLM Assisted Exploitation of AI-Guardian

Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems

Cross-Task Defense: Instruction-Tuning LLMs for Content Safety

Are You Human? An Adversarial Benchmark to Expose LLMs

Self-Deception: Reverse Penetrating the Semantic Firewall of Large Language Models

Mitigating Adversarial Attacks in LLMs through Defensive Suffix Generation

Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context