On the Role of Attention Heads in Large Language Model Safety

Zhenhong Zhou,Haiyang Yu,Xinghua Zhang,Rongwu Xu,Fei Huang,Kun Wang,Yang Liu,Junfeng Fang,Yongbin Li

2024-10-18

Abstract:Large language models (LLMs) achieve state-of-the-art performance on multiple language tasks, yet their safety guardrails can be circumvented, leading to harmful generations. In light of this, recent research on safety mechanisms has emerged, revealing that when safety representations or component are suppressed, the safety capability of LLMs are compromised. However, existing research tends to overlook the safety impact of multi-head attention mechanisms, despite their crucial role in various model functionalities. Hence, in this paper, we aim to explore the connection between standard attention mechanisms and safety capability to fill this gap in the safety-related mechanistic interpretability. We propose a novel metric which tailored for multi-head attention, the Safety Head ImPortant Score (Ships), to assess the individual heads' contributions to model safety. Based on this, we generalize Ships to the dataset level and further introduce the Safety Attention Head AttRibution Algorithm (Sahara) to attribute the critical safety attention heads inside the model. Our findings show that the special attention head has a significant impact on safety. Ablating a single safety head allows aligned model (e.g., Llama-2-7b-chat) to respond to 16 times more harmful queries, while only modifying 0.006% of the parameters, in contrast to the ~ 5% modification required in previous studies. More importantly, we demonstrate that attention heads primarily function as feature extractors for safety and models fine-tuned from the same base model exhibit overlapping safety heads through comprehensive experiments. Together, our attribution approach and findings provide a novel perspective for unpacking the black box of safety mechanisms within large models.

Computation and Language,Artificial Intelligence,Cryptography and Security,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in the safety mechanisms of large - language models (LLMs), the impact of the multi - head attention mechanism on model safety has not been fully studied. Although existing research has revealed that when safety representations or components are suppressed, the safety capabilities of LLMs will be affected, these studies often overlook the role of the multi - head attention mechanism in safety. Therefore, this paper aims to explore the connection between the standard attention mechanism and safety to fill this gap and improve the understanding of the internal safety mechanisms of large models. Specifically, the author proposes a new metric - Safety Head ImPortant Score (Ships) - to evaluate the contribution of a single attention head to model safety. Based on this, the author further extends Ships to the dataset level and introduces the Safety Attention Head AttRibution Algorithm (Sahara) algorithm to identify the critical safety attention heads inside the model. Through experiments, the author finds that specific attention heads have a significant impact on safety, and by modifying only a very small number of parameters (about 0.006%), the safety performance of the model can be significantly reduced. In addition, the author also finds that models fine - tuned from the same base model show overlap in safety attention heads, indicating that the safety influence of the base model is equally important. These findings provide a new perspective for unlocking the black box of large - model safety mechanisms.

On the Role of Attention Heads in Large Language Model Safety

Finding Safety Neurons in Large Language Models

Superficial Safety Alignment Hypothesis

SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models

SafetyBench: Evaluating the Safety of Large Language Models

Look Before You Leap: Safe Model-Based Reinforcement Learning with Human Intervention

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration

All Languages Matter: On the Multilingual Safety of Large Language Models

A safety realignment framework via subspace-oriented model fusion for large language models

MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models

Ensuring Safety and Trust: Analyzing the Risks of Large Language Models in Medicine

Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models

Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions

CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models

Attention Heads of Large Language Models: A Survey

SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models

Safety Alignment Should Be Made More Than Just a Few Tokens Deep