Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

Yuxi Li,Zhibo Zhang,Kailong Wang,Ling Shi,Haoyu Wang
2024-12-11
Abstract:Large Language Models (LLMs) have transformed numerous fields by enabling advanced natural language interactions but remain susceptible to critical vulnerabilities, particularly jailbreak attacks. Current jailbreak techniques, while effective, often depend on input modifications, making them detectable and limiting their stealth and scalability. This paper presents Targeted Model Editing (TME), a novel white-box approach that bypasses safety filters by minimally altering internal model structures while preserving the model's intended functionalities. TME identifies and removes safety-critical transformations (SCTs) embedded in model matrices, enabling malicious queries to bypass restrictions without input modifications. By analyzing distinct activation patterns between safe and unsafe queries, TME isolates and approximates SCTs through an optimization process. Implemented in the D-LLM framework, our method achieves an average Attack Success Rate (ASR) of 84.86% on four mainstream open-source LLMs, maintaining high performance. Unlike existing methods, D-LLM eliminates the need for specific triggers or harmful response collections, offering a stealthier and more effective jailbreak strategy. This work reveals a covert and robust threat vector in LLM security and emphasizes the need for stronger safeguards in model safety alignment.
Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the security issue of large - language models (LLMs), especially in response to "jailbreak attacks". Specifically, although the existing jailbreak techniques are effective, they usually rely on input modification, which makes them easy to be detected and limits their concealment and scalability. Therefore, this paper proposes a new white - box method - Targeted Model Editing (TME), which bypasses security filters by minimally changing the internal structure of the model while maintaining the expected functions of the model. ### Main problem summary: 1. **Limitations of existing jailbreak techniques**: Current jailbreak techniques rely on input modification (such as adding prefixes, inserting trigger words, etc.), which makes the attacks easy to be detected and reduces their concealment and effectiveness. 2. **Need for a more concealed attack method**: In order to improve the concealment and effectiveness of the attack, a method that does not need to modify the user input or prompt structure is required, and the operation is directly carried out from the inside of the model. 3. **How to bypass security mechanisms without significantly degrading model performance**: While bypassing security mechanisms, it is necessary to ensure that the normal functions of the model are not affected and avoid performance degradation. ### Solutions proposed in the paper: - **Targeted Model Editing (TME)**: By analyzing the different activation patterns between safe and unsafe queries, identify and remove the Safety - Critical Transformations (SCTs) embedded in the model matrix. TME approximates and isolates SCTs through an optimization process, thereby achieving minimal modification of the internal structure of the model. - **D - LLM framework**: Integrate TME into an automated jailbreak framework, enabling the model to directly respond to malicious queries without further input modification. This method achieves an average Attack Success Rate (ASR) of 84.86% on four mainstream open - source LLMs, and there is no significant performance degradation on standard tasks. ### Key contributions: 1. **Reveal new attack vectors**: It is proved that even without collecting harmful responses, trigger words or input modification, it is possible to easily bypass security - aligned LLMs. 2. **Empirical research and mechanism isolation**: Through empirical research, significant differences in activation patterns between safe and unsafe queries are found, and SCTs are successfully isolated. 3. **Optimization to achieve effective jailbreak**: By optimizing the problem - approximating difference matrix, SCTs are abstracted, thereby achieving jailbreak without affecting overall performance. 4. **High attack success rate and function retention**: An average ASR of 84.86% is achieved on four open - source LLMs, and the model performance is maintained in standard benchmark tests. In conclusion, this paper aims to reveal the hidden threats in LLMs through a novel white - box attack method and emphasizes the importance of strengthening the security alignment of the model.