Abstract:Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at <a class="link-external link-http" href="http://github.com/llm-attacks/llm-attacks" rel="external noopener nofollow">this http URL</a>.

Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass the Censorship of Text-to-Image Generation Model

Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass Safety Filters of Text-to-Image Models

Harnessing LLM to Attack LLM-Guarded Text-to-Image Models

On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts

SneakyPrompt: Jailbreaking Text-to-image Generative Models

Natural Language Induced Adversarial Images

Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

Generating Natural Language Adversarial Examples on a Large Scale with Generative Models

SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution

Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step

Backdooring Bias into Text-to-Image Models

BAGM: A Backdoor Attack for Manipulating Text-to-Image Generative Models

Adversarial Attacks on Large Language Model-Based System and Mitigating Strategies: A Case Study on ChatGPT

Universal and Transferable Adversarial Attacks on Aligned Language Models

Automatic Jailbreaking of the Text-to-Image Generative AI Systems

Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models

ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users

An LLM can Fool Itself: A Prompt-Based Adversarial Attack

Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation

Prompt Packer: Deceiving LLMs through Compositional Instruction with Hidden Attacks

Self-Deception: Reverse Penetrating the Semantic Firewall of Large Language Models