Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors

Weixuan Wang,Jingyuan Yang,Wei Peng
2024-10-16
Abstract:Large language models (LLMs) have achieved remarkable performance across many tasks, yet aligning them with desired behaviors remains challenging. Activation intervention has emerged as an effective and economical method to modify the behavior of LLMs. Despite considerable interest in this area, current intervention methods exclusively employ a fixed steering vector to modify model activations, lacking adaptability to diverse input semantics. To address this limitation, we propose Semantics-Adaptive Dynamic Intervention (SADI), a novel method that constructs a dynamic steering vector to intervene model activations at inference time. More specifically, SADI utilizes activation differences in contrastive pairs to precisely identify critical elements of an LLM (i.e., attention heads, hidden states, and neurons) for targeted intervention. During inference, SADI dynamically steers model behavior by scaling element-wise activations based on the directions of input semantics. Experimental results show that SADI outperforms established baselines by substantial margins, improving task performance without training. SADI's cost-effectiveness and generalizability across various LLM backbones and tasks highlight its potential as a versatile alignment technique. In addition, we release the code to foster research along this line:<a class="link-external link-https" href="https://github.com/weixuan-wang123/SADI" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to enable large - language models (LLMs) to flexibly adjust their behavior according to the input semantic context while maintaining their excellent performance. Existing activation intervention methods usually use fixed steering vectors to modify the activation state of the model, which lacks adaptability when dealing with diverse inputs and may lead to poor model performance. Therefore, the author proposes a new method - Semantics - Adaptive Dynamic Intervention (SADI), aiming to more precisely adjust the behavior of LLMs by dynamically generating steering vectors related to the input semantics. ### Specific Problem Description 1. **Alignment Problem**: - Although large - language models perform well in many tasks, it is still challenging to align their behavior with the expected goals. - Existing methods such as supervised fine - tuning (SFT), reinforcement learning from human feedback (RLHF), and prompt engineering are effective but also have limitations, such as requiring a large amount of data and being difficult to prevent the hallucination phenomenon. 2. **Limitations of Existing Activation Intervention Methods**: - Current activation intervention methods use fixed steering vectors to modify the model activation state, and this method lacks adaptability to different input semantics. - Fixed steering vectors may not be well - aligned with the input semantic direction, resulting in a decline in the model's prediction performance, especially when the input semantic differences are large. ### Proposed Solution To solve the above problems, the author proposes the SADI method. The specific steps are as follows: 1. **Difference Extraction**: - Extract the activation differences of each layer from the contrast pairs (positive and negative samples) to identify the key elements that affect the model behavior. - For each instance \(i\) and each layer \(l\), calculate the activation difference between positive and negative samples \(D^{(l)}_i=A^{(l)}_{\text{pos}, i}-A^{(l)}_{\text{neg}, i}\). 2. **Binary Masking**: - Calculate the average difference of all instances and layers and concatenate them into an overall average difference vector \(D\). - Create an identification mask \(M\) by binarizing the average difference, retaining only the elements that have a significant impact on the model behavior. 3. **Adaptive Steering**: - During the inference process, apply the identification mask to the activation state of the user input and perform element - by - element scaling according to the input semantic direction. - The update formula for the dynamic steering vector is \(A'_q = A_q+\delta(A_q\odot M)\), where \(\delta\) is a hyperparameter that controls the intervention intensity. ### Experimental Results The experimental results show that SADI significantly outperforms existing activation intervention methods in multiple multiple - choice tasks and open - ended generation tasks, especially in tasks with scarce data. In addition, SADI also shows good generalization ability and is suitable for LLMs of different scales and multilingual scenarios. In summary, SADI solves the problem of lack of adaptability of fixed steering vectors in existing methods by dynamically generating steering vectors related to the input semantics, achieving more precise and efficient model behavior adjustment.