Abstract:Large Language Models (LLMs) remain vulnerable to jailbreak attacks that bypass their safety mechanisms. Existing attack methods are fixed or specifically tailored for certain models and cannot flexibly adjust attack strength, which is critical for generalization when attacking models of various sizes. We introduce a novel scalable jailbreak attack that preempts the activation of an LLM's safety policies by occupying its computational resources. Our method involves engaging the LLM in a resource-intensive preliminary task - a Character Map lookup and decoding process - before presenting the target instruction. By saturating the model's processing capacity, we prevent the activation of safety protocols when processing the subsequent instruction. Extensive experiments on state-of-the-art LLMs demonstrate that our method achieves a high success rate in bypassing safety measures without requiring gradient access, manual prompt engineering. We verified our approach offers a scalable attack that quantifies attack strength and adapts to different model scales at the optimal strength. We shows safety policies of LLMs might be more susceptible to resource constraints. Our findings reveal a critical vulnerability in current LLM safety designs, highlighting the need for more robust defense strategies that account for resource-intense condition.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of jailbreak attacks in large language models (LLMs). Specifically, the authors are concerned with how to bypass the security mechanisms of LLMs by occupying computing resources, thereby achieving scalable jailbreak attacks. The following are the core problems and goals of this paper: 1. **Limitations of existing attack methods**: - Existing attack methods are usually fixed or specially tailored for certain specific models and cannot flexibly adjust the attack intensity. - These methods lack generality and flexibility when attacking models of different scales. 2. **Proposed new method**: - The authors introduce a new scalable jailbreak attack method that prevents the activation of the LLM's security policy by pre - occupying its computing resources. - This method involves making the LLM perform a resource - intensive pre - processing task - character - map lookup and decoding process - before presenting the target instruction. By saturating the processing capacity of the model, the security protocol is prevented from being activated during subsequent instruction processing. 3. **Experimental verification**: - The authors have verified the effectiveness of the new method through extensive experiments, proving that it can bypass security measures with a high success rate without the need for gradient access or manual prompt engineering. - Experiments show that this method can quantify the attack intensity and adapt to models of different scales to find the optimal attack intensity. 4. **Revealing key vulnerabilities**: - The study finds that the security policies of LLMs may be more vulnerable under resource - constrained conditions. - This finding reveals a key vulnerability in the current LLM security design, emphasizing the need for more robust defense strategies to deal with resource - based attacks. 5. **Research significance**: - This research not only shows how to use computing resource limitations to carry out jailbreak attacks, but also provides important implications for future LLM security design, that is, resource management issues need to be considered to enhance security. Through these efforts, the authors hope to promote a deeper understanding of LLM security and facilitate the development of more powerful defense strategies.

Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models

Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking

Distract Large Language Models for Automatic Jailbreak Attack

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization

Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

Comprehensive Assessment of Jailbreak Attacks Against LLMs

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks

Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

A Realistic Threat Model for Large Language Model Jailbreaks

Weak-to-Strong Jailbreaking on Large Language Models

A Cross-Language Investigation into Jailbreak Attacks in Large Language Models