Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models

Yiting Dong,Guobin Shen,Dongcheng Zhao,Xiang He,Yi Zeng
2024-10-05
Abstract:Large Language Models (LLMs) remain vulnerable to jailbreak attacks that bypass their safety mechanisms. Existing attack methods are fixed or specifically tailored for certain models and cannot flexibly adjust attack strength, which is critical for generalization when attacking models of various sizes. We introduce a novel scalable jailbreak attack that preempts the activation of an LLM's safety policies by occupying its computational resources. Our method involves engaging the LLM in a resource-intensive preliminary task - a Character Map lookup and decoding process - before presenting the target instruction. By saturating the model's processing capacity, we prevent the activation of safety protocols when processing the subsequent instruction. Extensive experiments on state-of-the-art LLMs demonstrate that our method achieves a high success rate in bypassing safety measures without requiring gradient access, manual prompt engineering. We verified our approach offers a scalable attack that quantifies attack strength and adapts to different model scales at the optimal strength. We shows safety policies of LLMs might be more susceptible to resource constraints. Our findings reveal a critical vulnerability in current LLM safety designs, highlighting the need for more robust defense strategies that account for resource-intense condition.
Cryptography and Security,Computation and Language
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of jailbreak attacks in large language models (LLMs). Specifically, the authors are concerned with how to bypass the security mechanisms of LLMs by occupying computing resources, thereby achieving scalable jailbreak attacks. The following are the core problems and goals of this paper: 1. **Limitations of existing attack methods**: - Existing attack methods are usually fixed or specially tailored for certain specific models and cannot flexibly adjust the attack intensity. - These methods lack generality and flexibility when attacking models of different scales. 2. **Proposed new method**: - The authors introduce a new scalable jailbreak attack method that prevents the activation of the LLM's security policy by pre - occupying its computing resources. - This method involves making the LLM perform a resource - intensive pre - processing task - character - map lookup and decoding process - before presenting the target instruction. By saturating the processing capacity of the model, the security protocol is prevented from being activated during subsequent instruction processing. 3. **Experimental verification**: - The authors have verified the effectiveness of the new method through extensive experiments, proving that it can bypass security measures with a high success rate without the need for gradient access or manual prompt engineering. - Experiments show that this method can quantify the attack intensity and adapt to models of different scales to find the optimal attack intensity. 4. **Revealing key vulnerabilities**: - The study finds that the security policies of LLMs may be more vulnerable under resource - constrained conditions. - This finding reveals a key vulnerability in the current LLM security design, emphasizing the need for more robust defense strategies to deal with resource - based attacks. 5. **Research significance**: - This research not only shows how to use computing resource limitations to carry out jailbreak attacks, but also provides important implications for future LLM security design, that is, resource management issues need to be considered to enhance security. Through these efforts, the authors hope to promote a deeper understanding of LLM security and facilitate the development of more powerful defense strategies.