Abstract:Large language models (LLMs) are expected to follow instructions from users and engage in conversations. Techniques to enhance LLMs' instruction-following capabilities typically fine-tune them using data structured according to a predefined chat template. Although chat templates are shown to be effective in optimizing LLM performance, their impact on safety alignment of LLMs has been less understood, which is crucial for deploying LLMs safely at scale. In this paper, we investigate how chat templates affect safety alignment of LLMs. We identify a common vulnerability, named ChatBug, that is introduced by chat templates. Our key insight to identify ChatBug is that the chat templates provide a rigid format that need to be followed by LLMs, but not by users. Hence, a malicious user may not necessarily follow the chat template when prompting LLMs. Instead, malicious users could leverage their knowledge of the chat template and accordingly craft their prompts to bypass safety alignments of LLMs. We develop two attacks to exploit the ChatBug vulnerability. We demonstrate that a malicious user can exploit the ChatBug vulnerability of eight state-of-the-art (SOTA) LLMs and effectively elicit unintended responses from these models. Moreover, we show that ChatBug can be exploited by existing jailbreak attacks to enhance their attack success rates. We investigate potential countermeasures to ChatBug. Our results show that while adversarial training effectively mitigates the ChatBug vulnerability, the victim model incurs significant performance degradation. These results highlight the trade-off between safety alignment and helpfulness. Developing new methods for instruction tuning to balance this trade-off is an open and critical direction for future research

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the vulnerabilities in the safety alignment of large - language models (LLMs) after fine - tuning using chat templates. Specifically, the author discovers that although chat templates can effectively optimize the performance of LLMs, they introduce a common vulnerability named "ChatBug". This vulnerability enables malicious users to bypass the safety alignment mechanism of LLMs through carefully - designed prompts, thus triggering unsafe or unexpected responses. In the paper, the author not only reveals the existence of ChatBug but also shows how to exploit this vulnerability and explores possible countermeasures. ### Main research contents: 1. **Identification of the ChatBug vulnerability**: - The author points out that since chat templates define strict format requirements, and these formats only need to be followed by LLMs, while users are not restricted by this. Therefore, malicious users can take advantage of this point, bypass the safety alignment mechanism by constructing specific inputs, and trigger harmful responses. 2. **Development of attack methods**: - The author has developed two attack methods to exploit the ChatBug vulnerability: Format Mismatch Attack and Message Overflow Attack. These two attack methods respectively induce LLMs to generate harmful responses by modifying the chat format or inserting additional tokens in the message. 3. **Experimental evaluation**: - The author has conducted experiments on eight different LLMs, including open - source and closed - source models, and verified the effectiveness and universality of the ChatBug vulnerability. The results show that even models that have undergone strict safety alignment are also vulnerable to ChatBug. 4. **The enhancing effect of ChatBug on existing jailbreak attacks**: - The paper further shows that ChatBug can enhance existing jailbreak attacks (such as GCG, GPTFuzzer and ArtPrompt), significantly increasing the success rate of these attacks. 5. **Potential countermeasures**: - The author has explored several methods to mitigate the ChatBug vulnerability, including mitigation - based measures (such as self - reminder, safe decoding and adversarial training) and detection - based measures (such as keyword filtering and classifiers). The experimental results show that although these methods can reduce the impact of ChatBug to a certain extent, they will also lead to a significant decline in model performance. ### Conclusion: The paper emphasizes the severity and universality of the ChatBug vulnerability, as well as the importance of finding a balance between the safety alignment and functionality of LLMs. The author calls on the community to work together to develop new instruction - tuning methods to better meet this challenge.

ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates

Exploring Backdoor Vulnerabilities of Chat Models

Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases

Hidden in Plain Sight: Exploring Chat History Tampering in Interactive Language Models

Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

Universal and Transferable Adversarial Attacks on Aligned Language Models

Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes

Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates

Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!

Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment

Robustifying Safety-Aligned Large Language Models through Clean Data Curation

How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States

Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots

What You See Is Not Always What You Get: An Empirical Study of Code Comprehension by Large Language Models

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring

Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections

The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models