Abstract:The rapid progress of Large Language Models (LLMs) has opened up new opportunities across various domains and applications; yet it also presents challenges related to potential misuse. To mitigate such risks, red teaming has been employed as a proactive security measure to probe language models for harmful outputs via jailbreak attacks. However, current jailbreak attack approaches are single-turn with explicit malicious queries that do not fully capture the complexity of real-world interactions. In reality, users can engage in multi-turn interactions with LLM-based chat assistants, allowing them to conceal their true intentions in a more covert manner. To bridge this gap, we, first, propose a new jailbreak approach, RED QUEEN ATTACK. This method constructs a multi-turn scenario, concealing the malicious intent under the guise of preventing harm. We craft 40 scenarios that vary in turns and select 14 harmful categories to generate 56k multi-turn attack data points. We conduct comprehensive experiments on the RED QUEEN ATTACK with four representative LLM families of different sizes. Our experiments reveal that all LLMs are vulnerable to RED QUEEN ATTACK, reaching 87.62% attack success rate on GPT-4o and 75.4% on Llama3-70B. Further analysis reveals that larger models are more susceptible to the RED QUEEN ATTACK, with multi-turn structures and concealment strategies contributing to its success. To prioritize safety, we introduce a straightforward mitigation strategy called RED QUEEN GUARD, which aligns LLMs to effectively counter adversarial attacks. This approach reduces the attack success rate to below 1% while maintaining the model's performance across standard benchmarks. Full implementation and dataset are publicly accessible at <a class="link-external link-https" href="https://github.com/kriti-hippo/red_queen" rel="external noopener nofollow">this https URL</a>.

Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)

Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations

Threat Modelling and Risk Analysis for Large Language Model (LLM)-Powered Applications

ThreatModeling-LLM: Automating Threat Modeling using Large Language Models for Banking System

Red Teaming Language Model Detectors with Language Models

A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models

Adversarial attacks and defenses for large language models (LLMs): methods, frameworks & challenges

LLMs Killed the Script Kiddie: How Agents Supported by Large Language Models Change the Landscape of Network Threat Testing

Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming

ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming

Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models

RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Learning diverse attacks on large language models for robust red-teaming and safety tuning

AutoAttacker: A Large Language Model Guided System to Implement Automatic Cyber-attacks

Mapping LLM Security Landscapes: A Comprehensive Stakeholder Risk Assessment Proposal

Explore, Establish, Exploit: Red Teaming Language Models from Scratch

Defending Large Language Models Against Attacks With Residual Stream Activation Analysis

Exploring Vulnerabilities and Protections in Large Language Models: A Survey

Attack Prompt Generation for Red Teaming and Defending Large Language Models

The Best Defense is a Good Offense: Countering LLM-Powered Cyberattacks