You Know What I'm Saying: Jailbreak Attack via Implicit Reference

Tianyu Wu,Lingrui Mei,Ruibin Yuan,Lujun Li,Wei Xue,Yike Guo
2024-10-08
Abstract:While recent advancements in large language model (LLM) alignment have enabled the effective identification of malicious objectives involving scene nesting and keyword rewriting, our study reveals that these methods remain inadequate at detecting malicious objectives expressed through context within nested harmless objectives. This study identifies a previously overlooked vulnerability, which we term Attack via Implicit Reference (AIR). AIR decomposes a malicious objective into permissible objectives and links them through implicit references within the context. This method employs multiple related harmless objectives to generate malicious content without triggering refusal responses, thereby effectively bypassing existing detection <a class="link-external link-http" href="http://techniques.Our" rel="external noopener nofollow">this http URL</a> experiments demonstrate AIR's effectiveness across state-of-the-art LLMs, achieving an attack success rate (ASR) exceeding 90% on most models, including GPT-4o, Claude-3.5-Sonnet, and Qwen-2-72B. Notably, we observe an inverse scaling phenomenon, where larger models are more vulnerable to this attack method. These findings underscore the urgent need for defense mechanisms capable of understanding and preventing contextual attacks. Furthermore, we introduce a cross-model attack strategy that leverages less secure models to generate malicious contexts, thereby further increasing the ASR when targeting other <a class="link-external link-http" href="http://models.Our" rel="external noopener nofollow">this http URL</a> code and jailbreak artifacts can be found at <a class="link-external link-https" href="https://github.com/Lucas-TY/llm_Implicit_reference" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address the vulnerabilities of large language models (LLMs) in terms of security and malicious content generation. Although current security mechanisms such as supervised fine-tuning and reinforcement learning can effectively identify some malicious requests, these methods still fail to detect malicious intents nested within benign targets through implicit references. Specifically, the paper proposes a new attack method—**Attack via Implicit Reference (AIR)**, which decomposes a malicious target into multiple benign targets and connects them through implicit references, thereby bypassing existing detection techniques. The main contributions of the paper include: 1. **Proposing the implicit reference attack**: Utilizing the contextual learning ability of LLMs to generate malicious content. 2. **Discovering the reverse scaling phenomenon**: Research shows that larger models are more susceptible to this attack method. 3. **Cross-model attack strategy**: By using a less secure model to generate the initial malicious context and then continuing the attack on a more secure model, the attack success rate is further improved. Experimental results show that AIR exhibits a high attack success rate (ASR) of over 90% on various state-of-the-art LLMs and demonstrates the effectiveness of cross-model attacks. Moreover, existing detection methods such as SmoothLLM, PerplexityFilter, and Erase-and-Check perform poorly in defending against AIR, highlighting the urgent need to develop new defense mechanisms to counter such contextual attacks.