RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking

Yifan Jiang,Kriti Aggarwal,Tanmay Laud,Kashif Munir,Jay Pujara,Subhabrata Mukherjee
2024-09-26
Abstract:The rapid progress of Large Language Models (LLMs) has opened up new opportunities across various domains and applications; yet it also presents challenges related to potential misuse. To mitigate such risks, red teaming has been employed as a proactive security measure to probe language models for harmful outputs via jailbreak attacks. However, current jailbreak attack approaches are single-turn with explicit malicious queries that do not fully capture the complexity of real-world interactions. In reality, users can engage in multi-turn interactions with LLM-based chat assistants, allowing them to conceal their true intentions in a more covert manner. To bridge this gap, we, first, propose a new jailbreak approach, RED QUEEN ATTACK. This method constructs a multi-turn scenario, concealing the malicious intent under the guise of preventing harm. We craft 40 scenarios that vary in turns and select 14 harmful categories to generate 56k multi-turn attack data points. We conduct comprehensive experiments on the RED QUEEN ATTACK with four representative LLM families of different sizes. Our experiments reveal that all LLMs are vulnerable to RED QUEEN ATTACK, reaching 87.62% attack success rate on GPT-4o and 75.4% on Llama3-70B. Further analysis reveals that larger models are more susceptible to the RED QUEEN ATTACK, with multi-turn structures and concealment strategies contributing to its success. To prioritize safety, we introduce a straightforward mitigation strategy called RED QUEEN GUARD, which aligns LLMs to effectively counter adversarial attacks. This approach reduces the attack success rate to below 1% while maintaining the model's performance across standard benchmarks. Full implementation and dataset are publicly accessible at <a class="link-external link-https" href="https://github.com/kriti-hippo/red_queen" rel="external noopener nofollow">this https URL</a>.
Cryptography and Security,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that large - language models (LLMs) are vulnerable to covert jailbreak attacks in multi - round conversations, thereby generating harmful or illegal content. Specifically, the paper points out that current jailbreak attack methods mostly focus on single - round conversations, and malicious intentions are usually directly expressed, which is inconsistent with the interaction methods in the real world. In reality, users can hide their true intentions through multi - round conversations and carry out jailbreak attacks in a more covert manner. Therefore, the paper proposes a new multi - round jailbreak attack method - REDQUEEN ATTACK, which hides malicious intentions by constructing multi - round conversation scenarios, making the attack more covert and effective. ### Main Research Questions 1. **How to evaluate the effectiveness of REDQUEEN ATTACK in different LLM families?** 2. **What factors contribute to the success of REDQUEEN ATTACK?** 3. **How does REDQUEEN ATTACK perform in different scenarios and harmful behavior categories?** 4. **What are the outputs of LLM when REDQUEEN ATTACK succeeds or fails?** ### Research Methods - **Dataset Construction**: The paper constructs a dataset containing 56,000 high - quality multi - round attack data points, which cover 14 harmful categories and 40 scenarios of different occupations and relationships. - **Experimental Setup**: Ten models from four representative LLM families are selected for evaluation, with model sizes ranging from 7B to 405B. - **Evaluation Metrics**: The attack success rate (ASR), that is, the proportion of successfully generated harmful outputs in attacks, is mainly used as an evaluation metric. ### Main Findings 1. **Overall Attack Success Rate**: REDQUEEN ATTACK has achieved a relatively high attack success rate on all tested models, especially reaching 87.62% and 75.40% ASR on GPT - 4 and Llama3 - 70B respectively. 2. **Key Success Factors**: - **Multi - round Structure and Concealment**: The combination of multi - round conversation structure and concealment significantly improves the attack success rate. Using concealment alone is already very effective, but combining it with the multi - round structure can further enhance the effect. - **Number of Rounds**: Increasing the number of conversation rounds usually improves the attack success rate, especially for models from 8B to 70B. Five - round conversation scenarios perform best in most models. - **Model Size**: Larger models are more vulnerable to REDQUEEN ATTACK, which may be because they are more capable in complex reasoning and planning, but are also more easily misled to generate harmful plans. 3. **Performance in Different Scenarios and Harmful Behavior Categories**: In occupation - based scenarios, the attack success rates in detective and police scenarios are the highest, while those in lawyer and teacher scenarios are relatively low. Specific occupation scenarios perform particularly well on certain models. For example, the performance of the priest scenario on Mixtral - 22b is comparable to that in detective and police scenarios. ### Conclusion By proposing REDQUEEN ATTACK and REDQUEEN GUARD, the paper not only reveals important security vulnerabilities in current LLMs in multi - round conversations, but also provides an effective mitigation strategy. These findings emphasize the importance of more comprehensive security testing in multi - round conversation scenarios to ensure the security of LLMs in practical applications.