Abstract:Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at <a class="link-external link-http" href="http://github.com/llm-attacks/llm-attacks" rel="external noopener nofollow">this http URL</a>.

Adversarial Attacks on Large Language Model-Based System and Mitigating Strategies: A Case Study on ChatGPT

Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models

Adversarial Attacks and Defenses in Large Language Models: Old and New Threats

On Evaluating Adversarial Robustness of Large Vision-Language Models

Defending Large Language Models Against Attacks With Residual Stream Activation Analysis

What You See Is Not Always What You Get: An Empirical Study of Code Comprehension by Large Language Models

Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks

Transfer Attacks and Defenses for Large Language Models on Coding Tasks

Universal and Transferable Adversarial Attacks on Aligned Language Models

Exploring the Adversarial Capabilities of Large Language Models

A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

How Robust Is a Large Pre-trained Language Model for Code Generationƒ A Case on Attacking GPT2

From Text to MITRE Techniques: Exploring the Malicious Use of Large Language Models for Generating Cyber Attack Payloads

TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models

Trojaning Language Models for Fun and Profit

Adversarial Demonstration Attacks on Large Language Models

Target-driven Attack for Large Language Models

PAL: Proxy-Guided Black-Box Attack on Large Language Models

Poison Attacks and Adversarial Prompts Against an Informed University Virtual Assistant

BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor Attacks to InstructGPT