Abstract:The rapid progress of Large Language Models (LLMs) has opened up new opportunities across various domains and applications; yet it also presents challenges related to potential misuse. To mitigate such risks, red teaming has been employed as a proactive security measure to probe language models for harmful outputs via jailbreak attacks. However, current jailbreak attack approaches are single-turn with explicit malicious queries that do not fully capture the complexity of real-world interactions. In reality, users can engage in multi-turn interactions with LLM-based chat assistants, allowing them to conceal their true intentions in a more covert manner. To bridge this gap, we, first, propose a new jailbreak approach, RED QUEEN ATTACK. This method constructs a multi-turn scenario, concealing the malicious intent under the guise of preventing harm. We craft 40 scenarios that vary in turns and select 14 harmful categories to generate 56k multi-turn attack data points. We conduct comprehensive experiments on the RED QUEEN ATTACK with four representative LLM families of different sizes. Our experiments reveal that all LLMs are vulnerable to RED QUEEN ATTACK, reaching 87.62% attack success rate on GPT-4o and 75.4% on Llama3-70B. Further analysis reveals that larger models are more susceptible to the RED QUEEN ATTACK, with multi-turn structures and concealment strategies contributing to its success. To prioritize safety, we introduce a straightforward mitigation strategy called RED QUEEN GUARD, which aligns LLMs to effectively counter adversarial attacks. This approach reduces the attack success rate to below 1% while maintaining the model's performance across standard benchmarks. Full implementation and dataset are publicly accessible at <a class="link-external link-https" href="https://github.com/kriti-hippo/red_queen" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that large - language models (LLMs) are vulnerable to covert jailbreak attacks in multi - round conversations, thereby generating harmful or illegal content. Specifically, the paper points out that current jailbreak attack methods mostly focus on single - round conversations, and malicious intentions are usually directly expressed, which is inconsistent with the interaction methods in the real world. In reality, users can hide their true intentions through multi - round conversations and carry out jailbreak attacks in a more covert manner. Therefore, the paper proposes a new multi - round jailbreak attack method - REDQUEEN ATTACK, which hides malicious intentions by constructing multi - round conversation scenarios, making the attack more covert and effective. ### Main Research Questions 1. **How to evaluate the effectiveness of REDQUEEN ATTACK in different LLM families?** 2. **What factors contribute to the success of REDQUEEN ATTACK?** 3. **How does REDQUEEN ATTACK perform in different scenarios and harmful behavior categories?** 4. **What are the outputs of LLM when REDQUEEN ATTACK succeeds or fails?** ### Research Methods - **Dataset Construction**: The paper constructs a dataset containing 56,000 high - quality multi - round attack data points, which cover 14 harmful categories and 40 scenarios of different occupations and relationships. - **Experimental Setup**: Ten models from four representative LLM families are selected for evaluation, with model sizes ranging from 7B to 405B. - **Evaluation Metrics**: The attack success rate (ASR), that is, the proportion of successfully generated harmful outputs in attacks, is mainly used as an evaluation metric. ### Main Findings 1. **Overall Attack Success Rate**: REDQUEEN ATTACK has achieved a relatively high attack success rate on all tested models, especially reaching 87.62% and 75.40% ASR on GPT - 4 and Llama3 - 70B respectively. 2. **Key Success Factors**: - **Multi - round Structure and Concealment**: The combination of multi - round conversation structure and concealment significantly improves the attack success rate. Using concealment alone is already very effective, but combining it with the multi - round structure can further enhance the effect. - **Number of Rounds**: Increasing the number of conversation rounds usually improves the attack success rate, especially for models from 8B to 70B. Five - round conversation scenarios perform best in most models. - **Model Size**: Larger models are more vulnerable to REDQUEEN ATTACK, which may be because they are more capable in complex reasoning and planning, but are also more easily misled to generate harmful plans. 3. **Performance in Different Scenarios and Harmful Behavior Categories**: In occupation - based scenarios, the attack success rates in detective and police scenarios are the highest, while those in lawyer and teacher scenarios are relatively low. Specific occupation scenarios perform particularly well on certain models. For example, the performance of the priest scenario on Mixtral - 22b is comparable to that in detective and police scenarios. ### Conclusion By proposing REDQUEEN ATTACK and REDQUEEN GUARD, the paper not only reveals important security vulnerabilities in current LLMs in multi - round conversations, but also provides an effective mitigation strategy. These findings emphasize the importance of more comprehensive security testing in multi - round conversation scenarios to ensure the security of LLMs in practical applications.

RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring

RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent

Distract Large Language Models for Automatic Jailbreak Attack

Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Jailbreaker in Jail: Moving Target Defense for Large Language Models

Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking

Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes

Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations

Jailbreaking Black Box Large Language Models in Twenty Queries