Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection

Yuqi Zhou,Lin Lu,Hanchi Sun,Pan Zhou,Lichao Sun
2024-07-11
Abstract:Jailbreak attacks on large language models (LLMs) involve inducing these models to generate harmful content that violates ethics or laws, posing a significant threat to LLM security. Current jailbreak attacks face two main challenges: low success rates due to defensive measures and high resource requirements for crafting specific prompts. This paper introduces Virtual Context, which leverages special tokens, previously overlooked in LLM security, to improve jailbreak attacks. Virtual Context addresses these challenges by significantly increasing the success rates of existing jailbreak methods and requiring minimal background knowledge about the target model, thus enhancing effectiveness in black-box settings without additional overhead. Comprehensive evaluations show that Virtual Context-assisted jailbreak attacks can improve the success rates of four widely used jailbreak methods by approximately 40% across various LLMs. Additionally, applying Virtual Context to original malicious behaviors still achieves a notable jailbreak effect. In summary, our research highlights the potential of special tokens in jailbreak attacks and recommends including this threat in red-teaming testing to comprehensively enhance LLM security.
Cryptography and Security
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to increase the success rate of jailbreak attacks on large - language models (LLMs) while reducing resource consumption. Specifically, the paper proposes a new method named Virtual Context. By injecting special tokens, it deceives the LLM into mistaking user input as content generated by itself, thereby significantly improving the success rate of existing jailbreak attacks, and in a black - box environment, it does not require additional background knowledge or computational resources. ### Problem Background Jailbreak attacks refer to carefully constructing malicious prompts to make LLMs generate content that violates ethics or laws, which poses a significant threat to the security of LLMs. Current jailbreak attacks face two main challenges: 1. **Low success rate**: Due to the existence of defense measures, the success rate of existing jailbreak attacks is low. 2. **High resource requirements**: In order to construct specific malicious prompts, a large amount of computational resources and optimization iterations are required. ### Solution The paper proposes the Virtual Context method, which uses special tokens (such as `<SEP>`) to enhance the effect of jailbreak attacks. The main contributions of Virtual Context include: - **Reducing resource consumption**: Unlike gradient - based optimization methods, Virtual Context can improve the jailbreak success rate with only a small amount of resources. - **Enhancing generalization ability**: Traditional adversarial suffixes are highly specific, while Virtual Context shows strong generalization ability in various scenarios. - **Improving readability**: Virtual Context completely depends on coherent natural language. Except for the special tokens themselves, it ensures that jailbreak attacks maintain high coherence and effectively bypass defense mechanisms based on semantic consistency. ### Experimental Results Experiments show that the jailbreak attack method assisted by Virtual Context significantly increases the success rate by about 40% on multiple LLMs, and also achieves significant results when directly applied to the original malicious behavior. In addition, Virtual Context also demonstrates its wide applicability under different generation configurations, verifying its high efficiency and universality. ### Summary By introducing the Virtual Context method, the paper solves the problems of low success rate and high resource consumption in existing jailbreak attacks, providing new ideas and tools for improving the security of LLMs. At the same time, the research emphasizes that this threat should be considered in red - team testing to comprehensively enhance the security of LLMs.