Abstract:Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at <a class="link-external link-http" href="http://github.com/llm-attacks/llm-attacks" rel="external noopener nofollow">this http URL</a>.

Training NLI Models Through Universal Adversarial Attack

Towards Improving Adversarial Training of NLP Models

Adversarial Training for Large Neural Language Models

Generating Universal Language Adversarial Examples by Understanding and Enhancing the Transferability Across Neural Models

Universal and Transferable Adversarial Attacks on Aligned Language Models

The triggers that open the NLP model backdoors are hidden in the adversarial samples

A Universal Defense Strategy Against Adversarial Attacks Based on Attention-Guided

Learning Universal Adversarial Perturbation by Adversarial Example

Learning to Attack: Towards Textual Adversarial Attacking in Real-world Situations

Exploring the Vulnerability of Natural Language Processing Models via Universal Adversarial Texts

Rethinking Textual Adversarial Defense for Pre-trained Language Models

Teaching a Language Model to Distinguish Between Similar Details using a Small Adversarial Training Set

Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning

Universal Rules for Fooling Deep Neural Networks based Text Classification

TF-Attack: Transferable and Fast Adversarial Attacks on Large Language Models

Enhancing Neural Models with Vulnerability Via Adversarial Attack.

Bypassing DARCY Defense: Indistinguishable Universal Adversarial Triggers

Are aligned neural networks adversarially aligned?

Generating Valid and Natural Adversarial Examples with Large Language Models

Joint Universal Adversarial Perturbations with Interpretations

Towards Variable-Length Textual Adversarial Attacks