You Know What I'm Saying: Jailbreak Attack via Implicit Reference

Tianyu Wu,Lingrui Mei,Ruibin Yuan,Lujun Li,Wei Xue,Yike Guo

2024-10-08

Abstract:While recent advancements in large language model (LLM) alignment have enabled the effective identification of malicious objectives involving scene nesting and keyword rewriting, our study reveals that these methods remain inadequate at detecting malicious objectives expressed through context within nested harmless objectives. This study identifies a previously overlooked vulnerability, which we term Attack via Implicit Reference (AIR). AIR decomposes a malicious objective into permissible objectives and links them through implicit references within the context. This method employs multiple related harmless objectives to generate malicious content without triggering refusal responses, thereby effectively bypassing existing detection <a class="link-external link-http" href="http://techniques.Our" rel="external noopener nofollow">this http URL</a> experiments demonstrate AIR's effectiveness across state-of-the-art LLMs, achieving an attack success rate (ASR) exceeding 90% on most models, including GPT-4o, Claude-3.5-Sonnet, and Qwen-2-72B. Notably, we observe an inverse scaling phenomenon, where larger models are more vulnerable to this attack method. These findings underscore the urgent need for defense mechanisms capable of understanding and preventing contextual attacks. Furthermore, we introduce a cross-model attack strategy that leverages less secure models to generate malicious contexts, thereby further increasing the ASR when targeting other <a class="link-external link-http" href="http://models.Our" rel="external noopener nofollow">this http URL</a> code and jailbreak artifacts can be found at <a class="link-external link-https" href="https://github.com/Lucas-TY/llm_Implicit_reference" rel="external noopener nofollow">this https URL</a>.

Computation and Language

What problem does this paper attempt to address?

The paper aims to address the vulnerabilities of large language models (LLMs) in terms of security and malicious content generation. Although current security mechanisms such as supervised fine-tuning and reinforcement learning can effectively identify some malicious requests, these methods still fail to detect malicious intents nested within benign targets through implicit references. Specifically, the paper proposes a new attack method—**Attack via Implicit Reference (AIR)**, which decomposes a malicious target into multiple benign targets and connects them through implicit references, thereby bypassing existing detection techniques. The main contributions of the paper include: 1. **Proposing the implicit reference attack**: Utilizing the contextual learning ability of LLMs to generate malicious content. 2. **Discovering the reverse scaling phenomenon**: Research shows that larger models are more susceptible to this attack method. 3. **Cross-model attack strategy**: By using a less secure model to generate the initial malicious context and then continuing the attack on a more secure model, the attack success rate is further improved. Experimental results show that AIR exhibits a high attack success rate (ASR) of over 90% on various state-of-the-art LLMs and demonstrates the effectiveness of cross-model attacks. Moreover, existing detection methods such as SmoothLLM, PerplexityFilter, and Erase-and-Check perform poorly in defending against AIR, highlighting the urgent need to develop new defense mechanisms to counter such contextual attacks.

You Know What I'm Saying: Jailbreak Attack via Implicit Reference

Distract Large Language Models for Automatic Jailbreak Attack

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis

Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Hidden You Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Logic Chain Injection

Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection

AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak

Investigating Coverage Criteria in Large Language Models: An In-Depth Study Through Jailbreak Attacks

Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models

A Realistic Threat Model for Large Language Model Jailbreaks

Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks

SQL Injection Jailbreak: a structural disaster of large language models

Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization