Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

Guobin Shen,Dongcheng Zhao,Yiting Dong,Xiang He,Yi Zeng
2024-10-07
Abstract:As large language models (LLMs) become integral to various applications, ensuring both their safety and utility is paramount. Jailbreak attacks, which manipulate LLMs into generating harmful content, pose significant challenges to this balance. Existing defenses, such as prompt engineering and safety fine-tuning, often introduce computational overhead, increase inference latency, and lack runtime flexibility. Moreover, overly restrictive safety measures can degrade model utility by causing refusals of benign queries. In this paper, we introduce Jailbreak Antidote, a method that enables real-time adjustment of LLM safety preferences by manipulating a sparse subset of the model's internal states during inference. By shifting the model's hidden representations along a safety direction with varying strengths, we achieve flexible control over the safety-utility balance without additional token overhead or inference delays. Our analysis reveals that safety-related information in LLMs is sparsely distributed; adjusting approximately 5% of the internal state is as effective as modifying the entire state. Extensive experiments on nine LLMs (ranging from 2 billion to 72 billion parameters), evaluated against ten jailbreak attack methods and compared with six defense strategies, validate the effectiveness and efficiency of our approach. By directly manipulating internal states during reasoning, Jailbreak Antidote offers a lightweight, scalable solution that enhances LLM safety while preserving utility, opening new possibilities for real-time safety mechanisms in widely-deployed AI systems.
Cryptography and Security,Computation and Language
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the balance between security and usability that large language models (LLMs) face during deployment. Specifically, it focuses on how to ensure the model's security without compromising its responsiveness to benign requests. #### Core Issues 1. **Jailbreak Attacks**: - Jailbreak attacks use malicious prompts (such as methods to create explosives) to make the model generate harmful content, thereby compromising the model's security mechanisms. 2. **Limitations of Existing Defense Methods**: - Existing defense strategies (such as prompt engineering, security fine-tuning, etc.) often introduce additional computational overhead, increase inference latency, and lack real-time flexibility. - Overly strict defense measures may cause the model to reject normal requests, reducing user experience. #### Solution The paper proposes a method called "Jailbreak Antidote," which achieves real-time security adjustments by modifying approximately 5% of the key components in the model's internal state. This method avoids the computational overhead and lack of flexibility found in traditional defense techniques, allowing for a flexible balance between security and usability. #### Main Contributions 1. **Real-time Security Adjustment**: - By adjusting specific components of the model's internal state, real-time security preference adjustments are achieved without additional computational overhead or prompt modifications. 2. **Balancing Security and Usability**: - By adjusting internal representations, the study quantifies the trade-off between security and usability. The results show that this method better balances security and usability compared to existing defense strategies. 3. **Comprehensive Validation**: - Extensive experiments were conducted on 9 different scales of LLMs, covering 10 different jailbreak attack methods and 6 defense strategies. The results show that this method significantly outperforms existing defense methods in balancing security and usability. In summary, the paper proposes a lightweight and adaptive solution to address the challenges faced by large language models in practical applications, ensuring both the security and usability of the models.