Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Yangsibo Huang,Samyak Gupta,Mengzhou Xia,Kai Li,Danqi Chen
2023-10-11
Abstract:The rapid progress in open-source large language models (LLMs) is significantly advancing AI development. Extensive efforts have been made before model release to align their behavior with human values, with the primary goal of ensuring their helpfulness and harmlessness. However, even carefully aligned models can be manipulated maliciously, leading to unintended behaviors, known as "jailbreaks". These jailbreaks are typically triggered by specific text inputs, often referred to as adversarial prompts. In this work, we propose the generation exploitation attack, an extremely simple approach that disrupts model alignment by only manipulating variations of decoding methods. By exploiting different generation strategies, including varying decoding hyper-parameters and sampling methods, we increase the misalignment rate from 0% to more than 95% across 11 language models including LLaMA2, Vicuna, Falcon, and MPT families, outperforming state-of-the-art attacks with $30\times$ lower computational cost. Finally, we propose an effective alignment method that explores diverse generation strategies, which can reasonably reduce the misalignment rate under our attack. Altogether, our study underscores a major failure in current safety evaluation and alignment procedures for open-source LLMs, strongly advocating for more comprehensive red teaming and better alignment before releasing such models. Our code is available at <a class="link-external link-https" href="https://github.com/Princeton-SysML/Jailbreak_LLM" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence,Cryptography and Security
What problem does this paper attempt to address?
The problem this paper attempts to address is the significant shortcomings in the security evaluation and alignment procedures of current open-source large language models (LLMs). Even carefully aligned models can be maliciously manipulated, leading to unexpected behaviors, known as "jailbreaks." These jailbreaks are typically triggered by specific text inputs called adversarial prompts. This paper proposes a generation exploitation attack, which disrupts the model's alignment by merely manipulating different variants of the decoding method, thereby increasing the model's mismatch rate. Specifically, the main contributions of the paper include: 1. **Generation Exploitation Attack**: By altering decoding hyperparameters and sampling methods, the researchers were able to increase the mismatch rate of 11 open-source language models from 0% to over 95%, with computational costs 30 times lower than existing state-of-the-art attacks. 2. **New Evaluation Benchmark**: The researchers created a new benchmark dataset, MaliciousInstruct, which covers a broader range of malicious intents to evaluate the adaptability and effectiveness of models under different generation strategies. 3. **Generation-Aware Alignment Method**: To counter generation exploitation attacks, the researchers proposed a generation-aware alignment method. This method enhances model alignment by actively collecting model outputs under different generation configurations, significantly reducing the mismatch rate. Overall, this study reveals significant failures in the security evaluation and alignment of current open-source LLMs and strongly calls for comprehensive red team testing and better alignment methods before model deployment.