Abstract:Jailbreak attacks on large language models (LLMs) involve inducing these models to generate harmful content that violates ethics or laws, posing a significant threat to LLM security. Current jailbreak attacks face two main challenges: low success rates due to defensive measures and high resource requirements for crafting specific prompts. This paper introduces Virtual Context, which leverages special tokens, previously overlooked in LLM security, to improve jailbreak attacks. Virtual Context addresses these challenges by significantly increasing the success rates of existing jailbreak methods and requiring minimal background knowledge about the target model, thus enhancing effectiveness in black-box settings without additional overhead. Comprehensive evaluations show that Virtual Context-assisted jailbreak attacks can improve the success rates of four widely used jailbreak methods by approximately 40% across various LLMs. Additionally, applying Virtual Context to original malicious behaviors still achieves a notable jailbreak effect. In summary, our research highlights the potential of special tokens in jailbreak attacks and recommends including this threat in red-teaming testing to comprehensively enhance LLM security.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to increase the success rate of jailbreak attacks on large - language models (LLMs) while reducing resource consumption. Specifically, the paper proposes a new method named Virtual Context. By injecting special tokens, it deceives the LLM into mistaking user input as content generated by itself, thereby significantly improving the success rate of existing jailbreak attacks, and in a black - box environment, it does not require additional background knowledge or computational resources. ### Problem Background Jailbreak attacks refer to carefully constructing malicious prompts to make LLMs generate content that violates ethics or laws, which poses a significant threat to the security of LLMs. Current jailbreak attacks face two main challenges: 1. **Low success rate**: Due to the existence of defense measures, the success rate of existing jailbreak attacks is low. 2. **High resource requirements**: In order to construct specific malicious prompts, a large amount of computational resources and optimization iterations are required. ### Solution The paper proposes the Virtual Context method, which uses special tokens (such as `<SEP>`) to enhance the effect of jailbreak attacks. The main contributions of Virtual Context include: - **Reducing resource consumption**: Unlike gradient - based optimization methods, Virtual Context can improve the jailbreak success rate with only a small amount of resources. - **Enhancing generalization ability**: Traditional adversarial suffixes are highly specific, while Virtual Context shows strong generalization ability in various scenarios. - **Improving readability**: Virtual Context completely depends on coherent natural language. Except for the special tokens themselves, it ensures that jailbreak attacks maintain high coherence and effectively bypass defense mechanisms based on semantic consistency. ### Experimental Results Experiments show that the jailbreak attack method assisted by Virtual Context significantly increases the success rate by about 40% on multiple LLMs, and also achieves significant results when directly applied to the original malicious behavior. In addition, Virtual Context also demonstrates its wide applicability under different generation configurations, verifying its high efficiency and universality. ### Summary By introducing the Virtual Context method, the paper solves the problems of low success rate and high resource consumption in existing jailbreak attacks, providing new ideas and tools for improving the security of LLMs. At the same time, the research emphasizes that this threat should be considered in red - team testing to comprehensively enhance the security of LLMs.

Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection

Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens

Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles

Distract Large Language Models for Automatic Jailbreak Attack

IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves

Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

IDEATOR: Jailbreaking VLMs Using VLMs

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Efficient LLM-Jailbreaking by Introducing Visual Modality

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Defending Jailbreak Prompts via In-Context Adversarial Game

Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts

Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation

RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring

Comprehensive Assessment of Jailbreak Attacks Against LLMs

Model-Editing-Based Jailbreak against Safety-aligned Large Language Models