Abstract:As large language models (LLMs) become integral to various applications, ensuring both their safety and utility is paramount. Jailbreak attacks, which manipulate LLMs into generating harmful content, pose significant challenges to this balance. Existing defenses, such as prompt engineering and safety fine-tuning, often introduce computational overhead, increase inference latency, and lack runtime flexibility. Moreover, overly restrictive safety measures can degrade model utility by causing refusals of benign queries. In this paper, we introduce Jailbreak Antidote, a method that enables real-time adjustment of LLM safety preferences by manipulating a sparse subset of the model's internal states during inference. By shifting the model's hidden representations along a safety direction with varying strengths, we achieve flexible control over the safety-utility balance without additional token overhead or inference delays. Our analysis reveals that safety-related information in LLMs is sparsely distributed; adjusting approximately 5% of the internal state is as effective as modifying the entire state. Extensive experiments on nine LLMs (ranging from 2 billion to 72 billion parameters), evaluated against ten jailbreak attack methods and compared with six defense strategies, validate the effectiveness and efficiency of our approach. By directly manipulating internal states during reasoning, Jailbreak Antidote offers a lightweight, scalable solution that enhances LLM safety while preserving utility, opening new possibilities for real-time safety mechanisms in widely-deployed AI systems.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the balance between security and usability that large language models (LLMs) face during deployment. Specifically, it focuses on how to ensure the model's security without compromising its responsiveness to benign requests. #### Core Issues 1. **Jailbreak Attacks**: - Jailbreak attacks use malicious prompts (such as methods to create explosives) to make the model generate harmful content, thereby compromising the model's security mechanisms. 2. **Limitations of Existing Defense Methods**: - Existing defense strategies (such as prompt engineering, security fine-tuning, etc.) often introduce additional computational overhead, increase inference latency, and lack real-time flexibility. - Overly strict defense measures may cause the model to reject normal requests, reducing user experience. #### Solution The paper proposes a method called "Jailbreak Antidote," which achieves real-time security adjustments by modifying approximately 5% of the key components in the model's internal state. This method avoids the computational overhead and lack of flexibility found in traditional defense techniques, allowing for a flexible balance between security and usability. #### Main Contributions 1. **Real-time Security Adjustment**: - By adjusting specific components of the model's internal state, real-time security preference adjustments are achieved without additional computational overhead or prompt modifications. 2. **Balancing Security and Usability**: - By adjusting internal representations, the study quantifies the trade-off between security and usability. The results show that this method better balances security and usability compared to existing defense strategies. 3. **Comprehensive Validation**: - Extensive experiments were conducted on 9 different scales of LLMs, covering 10 different jailbreak attack methods and 6 defense strategies. The results show that this method significantly outperforms existing defense methods in balancing security and usability. In summary, the paper proposes a lightweight and adaptive solution to address the challenges faced by large language models in practical applications, ensuring both the security and usability of the models.

Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization

Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Distract Large Language Models for Automatic Jailbreak Attack

Jailbreaking Black Box Large Language Models in Twenty Queries

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters

A Realistic Threat Model for Large Language Model Jailbreaks

Jailbreaking? One Step Is Enough!

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization

Self-Guard: Empower the LLM to Safeguard Itself

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs