Abstract:Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at <a class="link-external link-http" href="http://github.com/llm-attacks/llm-attacks" rel="external noopener nofollow">this http URL</a>.

Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

Open Sesame! Universal Black Box Jailbreaking of Large Language Models

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Mission Impossible: A Statistical Perspective on Jailbreaking LLMs

A Cross-Language Investigation into Jailbreak Attacks in Large Language Models

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring

Weak-to-Strong Jailbreaking on Large Language Models

Playing Language Game with LLMs Leads to Jailbreaking

Universal and Transferable Adversarial Attacks on Aligned Language Models

Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

Universal Jailbreak Backdoors from Poisoned Human Feedback

Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models?

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Distract Large Language Models for Automatic Jailbreak Attack

Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

Low-Resource Languages Jailbreak GPT-4