Abstract:Large Language Models (LLMs) have transformed numerous fields by enabling advanced natural language interactions but remain susceptible to critical vulnerabilities, particularly jailbreak attacks. Current jailbreak techniques, while effective, often depend on input modifications, making them detectable and limiting their stealth and scalability. This paper presents Targeted Model Editing (TME), a novel white-box approach that bypasses safety filters by minimally altering internal model structures while preserving the model's intended functionalities. TME identifies and removes safety-critical transformations (SCTs) embedded in model matrices, enabling malicious queries to bypass restrictions without input modifications. By analyzing distinct activation patterns between safe and unsafe queries, TME isolates and approximates SCTs through an optimization process. Implemented in the D-LLM framework, our method achieves an average Attack Success Rate (ASR) of 84.86% on four mainstream open-source LLMs, maintaining high performance. Unlike existing methods, D-LLM eliminates the need for specific triggers or harmful response collections, offering a stealthier and more effective jailbreak strategy. This work reveals a covert and robust threat vector in LLM security and emphasizes the need for stronger safeguards in model safety alignment.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the security issue of large - language models (LLMs), especially in response to "jailbreak attacks". Specifically, although the existing jailbreak techniques are effective, they usually rely on input modification, which makes them easy to be detected and limits their concealment and scalability. Therefore, this paper proposes a new white - box method - Targeted Model Editing (TME), which bypasses security filters by minimally changing the internal structure of the model while maintaining the expected functions of the model. ### Main problem summary: 1. **Limitations of existing jailbreak techniques**: Current jailbreak techniques rely on input modification (such as adding prefixes, inserting trigger words, etc.), which makes the attacks easy to be detected and reduces their concealment and effectiveness. 2. **Need for a more concealed attack method**: In order to improve the concealment and effectiveness of the attack, a method that does not need to modify the user input or prompt structure is required, and the operation is directly carried out from the inside of the model. 3. **How to bypass security mechanisms without significantly degrading model performance**: While bypassing security mechanisms, it is necessary to ensure that the normal functions of the model are not affected and avoid performance degradation. ### Solutions proposed in the paper: - **Targeted Model Editing (TME)**: By analyzing the different activation patterns between safe and unsafe queries, identify and remove the Safety - Critical Transformations (SCTs) embedded in the model matrix. TME approximates and isolates SCTs through an optimization process, thereby achieving minimal modification of the internal structure of the model. - **D - LLM framework**: Integrate TME into an automated jailbreak framework, enabling the model to directly respond to malicious queries without further input modification. This method achieves an average Attack Success Rate (ASR) of 84.86% on four mainstream open - source LLMs, and there is no significant performance degradation on standard tasks. ### Key contributions: 1. **Reveal new attack vectors**: It is proved that even without collecting harmful responses, trigger words or input modification, it is possible to easily bypass security - aligned LLMs. 2. **Empirical research and mechanism isolation**: Through empirical research, significant differences in activation patterns between safe and unsafe queries are found, and SCTs are successfully isolated. 3. **Optimization to achieve effective jailbreak**: By optimizing the problem - approximating difference matrix, SCTs are abstracted, thereby achieving jailbreak without affecting overall performance. 4. **High attack success rate and function retention**: An average ASR of 84.86% is achieved on four open - source LLMs, and the model performance is maintained in standard benchmark tests. In conclusion, this paper aims to reveal the hidden threats in LLMs through a novel white - box attack method and emphasizes the importance of strengthening the security alignment of the model.

Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

Distract Large Language Models for Automatic Jailbreak Attack

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring

Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models

Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

A Realistic Threat Model for Large Language Model Jailbreaks

Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization

Jailbreaking Black Box Large Language Models in Twenty Queries

Comprehensive Assessment of Jailbreak Attacks Against LLMs

Weak-to-Strong Jailbreaking on Large Language Models

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters