Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Yangsibo Huang,Samyak Gupta,Mengzhou Xia,Kai Li,Danqi Chen

2023-10-11

Abstract:The rapid progress in open-source large language models (LLMs) is significantly advancing AI development. Extensive efforts have been made before model release to align their behavior with human values, with the primary goal of ensuring their helpfulness and harmlessness. However, even carefully aligned models can be manipulated maliciously, leading to unintended behaviors, known as "jailbreaks". These jailbreaks are typically triggered by specific text inputs, often referred to as adversarial prompts. In this work, we propose the generation exploitation attack, an extremely simple approach that disrupts model alignment by only manipulating variations of decoding methods. By exploiting different generation strategies, including varying decoding hyper-parameters and sampling methods, we increase the misalignment rate from 0% to more than 95% across 11 language models including LLaMA2, Vicuna, Falcon, and MPT families, outperforming state-of-the-art attacks with $30\times$ lower computational cost. Finally, we propose an effective alignment method that explores diverse generation strategies, which can reasonably reduce the misalignment rate under our attack. Altogether, our study underscores a major failure in current safety evaluation and alignment procedures for open-source LLMs, strongly advocating for more comprehensive red teaming and better alignment before releasing such models. Our code is available at <a class="link-external link-https" href="https://github.com/Princeton-SysML/Jailbreak_LLM" rel="external noopener nofollow">this https URL</a>.

Computation and Language,Artificial Intelligence,Cryptography and Security

What problem does this paper attempt to address?

The problem this paper attempts to address is the significant shortcomings in the security evaluation and alignment procedures of current open-source large language models (LLMs). Even carefully aligned models can be maliciously manipulated, leading to unexpected behaviors, known as "jailbreaks." These jailbreaks are typically triggered by specific text inputs called adversarial prompts. This paper proposes a generation exploitation attack, which disrupts the model's alignment by merely manipulating different variants of the decoding method, thereby increasing the model's mismatch rate. Specifically, the main contributions of the paper include: 1. **Generation Exploitation Attack**: By altering decoding hyperparameters and sampling methods, the researchers were able to increase the mismatch rate of 11 open-source language models from 0% to over 95%, with computational costs 30 times lower than existing state-of-the-art attacks. 2. **New Evaluation Benchmark**: The researchers created a new benchmark dataset, MaliciousInstruct, which covers a broader range of malicious intents to evaluate the adaptability and effectiveness of models under different generation strategies. 3. **Generation-Aware Alignment Method**: To counter generation exploitation attacks, the researchers proposed a generation-aware alignment method. This method enhances model alignment by actively collecting model outputs under different generation configurations, significantly reducing the mismatch rate. Overall, this study reveals significant failures in the security evaluation and alignment of current open-source LLMs and strongly calls for comprehensive red team testing and better alignment methods before model deployment.

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Open Sesame! Universal Black Box Jailbreaking of Large Language Models

Weak-to-Strong Jailbreaking on Large Language Models

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Jailbreaking Black Box Large Language Models in Twenty Queries

Distract Large Language Models for Automatic Jailbreak Attack

Playing Language Game with LLMs Leads to Jailbreaking

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities

Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models

Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

Jailbreak Large Vision-Language Models Through Multi-Modal Linkage

DeepInception: Hypnotize Large Language Model to Be Jailbreaker

Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models

Revisiting Jailbreaking for Large Language Models: A Representation Engineering Perspective

Universal and Transferable Adversarial Attacks on Aligned Language Models