Abstract:Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at <a class="link-external link-http" href="http://github.com/llm-attacks/llm-attacks" rel="external noopener nofollow">this http URL</a>.

ValCAT: Variable-Length Contextualized Adversarial Transformations Using Encoder-Decoder Language Model

Towards Variable-Length Textual Adversarial Attacks

CAT-Gen: Improving Robustness in NLP Models via Controlled Adversarial Text Generation

VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models

Textual Adversarial Attack As Combinatorial Optimization

Vision-fused Attack: Advancing Aggressive and Stealthy Adversarial Text against Neural Machine Translation

SCAT: Robust Self-supervised Contrastive Learning via Adversarial Training for Text Classification

Character-level White-Box Adversarial Attacks Against Transformers Via Attachable Subwords Substitution

Mutual-modality Adversarial Attack with Semantic Perturbation

Exploring the Vulnerability of Natural Language Processing Models via Universal Adversarial Texts

Learning to Attack: Towards Textual Adversarial Attacking in Real-world Situations

SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments

Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks

Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective

Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach

AdvExpander: Generating Natural Language Adversarial Examples by Expanding Text

TF-Attack: Transferable and Fast Adversarial Attacks on Large Language Models

One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models

Towards Improving Adversarial Training of NLP Models

TransAudio: Towards the Transferable Adversarial Audio Attack via Learning Contextualized Perturbations

Universal and Transferable Adversarial Attacks on Aligned Language Models