Abstract:Large language models (LLMs) have achieved remarkable performance across many tasks, yet aligning them with desired behaviors remains challenging. Activation intervention has emerged as an effective and economical method to modify the behavior of LLMs. Despite considerable interest in this area, current intervention methods exclusively employ a fixed steering vector to modify model activations, lacking adaptability to diverse input semantics. To address this limitation, we propose Semantics-Adaptive Dynamic Intervention (SADI), a novel method that constructs a dynamic steering vector to intervene model activations at inference time. More specifically, SADI utilizes activation differences in contrastive pairs to precisely identify critical elements of an LLM (i.e., attention heads, hidden states, and neurons) for targeted intervention. During inference, SADI dynamically steers model behavior by scaling element-wise activations based on the directions of input semantics. Experimental results show that SADI outperforms established baselines by substantial margins, improving task performance without training. SADI's cost-effectiveness and generalizability across various LLM backbones and tasks highlight its potential as a versatile alignment technique. In addition, we release the code to foster research along this line:<a class="link-external link-https" href="https://github.com/weixuan-wang123/SADI" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to enable large - language models (LLMs) to flexibly adjust their behavior according to the input semantic context while maintaining their excellent performance. Existing activation intervention methods usually use fixed steering vectors to modify the activation state of the model, which lacks adaptability when dealing with diverse inputs and may lead to poor model performance. Therefore, the author proposes a new method - Semantics - Adaptive Dynamic Intervention (SADI), aiming to more precisely adjust the behavior of LLMs by dynamically generating steering vectors related to the input semantics. ### Specific Problem Description 1. **Alignment Problem**: - Although large - language models perform well in many tasks, it is still challenging to align their behavior with the expected goals. - Existing methods such as supervised fine - tuning (SFT), reinforcement learning from human feedback (RLHF), and prompt engineering are effective but also have limitations, such as requiring a large amount of data and being difficult to prevent the hallucination phenomenon. 2. **Limitations of Existing Activation Intervention Methods**: - Current activation intervention methods use fixed steering vectors to modify the model activation state, and this method lacks adaptability to different input semantics. - Fixed steering vectors may not be well - aligned with the input semantic direction, resulting in a decline in the model's prediction performance, especially when the input semantic differences are large. ### Proposed Solution To solve the above problems, the author proposes the SADI method. The specific steps are as follows: 1. **Difference Extraction**: - Extract the activation differences of each layer from the contrast pairs (positive and negative samples) to identify the key elements that affect the model behavior. - For each instance \(i\) and each layer \(l\), calculate the activation difference between positive and negative samples \(D^{(l)}_i=A^{(l)}_{\text{pos}, i}-A^{(l)}_{\text{neg}, i}\). 2. **Binary Masking**: - Calculate the average difference of all instances and layers and concatenate them into an overall average difference vector \(D\). - Create an identification mask \(M\) by binarizing the average difference, retaining only the elements that have a significant impact on the model behavior. 3. **Adaptive Steering**: - During the inference process, apply the identification mask to the activation state of the user input and perform element - by - element scaling according to the input semantic direction. - The update formula for the dynamic steering vector is \(A'_q = A_q+\delta(A_q\odot M)\), where \(\delta\) is a hyperparameter that controls the intervention intensity. ### Experimental Results The experimental results show that SADI significantly outperforms existing activation intervention methods in multiple multiple - choice tasks and open - ended generation tasks, especially in tasks with scarce data. In addition, SADI also shows good generalization ability and is suitable for LLMs of different scales and multilingual scenarios. In summary, SADI solves the problem of lack of adaptability of fixed steering vectors in existing methods by dynamically generating steering vectors related to the input semantics, achieving more precise and efficient model behavior adjustment.

Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors

CogSteer: Cognition-Inspired Selective Layer Intervention for Efficient Semantic Steering in Large Language Models

Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization

Improving Activation Steering in Language Models with Mean-Centring

Steering Llama 2 via Contrastive Activation Addition

Steering Language Models With Activation Engineering

SDSAT: Accelerating LLM Inference through Speculative Decoding with Semantic Adaptive Tokens

Experimental Design for Active Transductive Inference in Large Language Models

Activation Scaling for Steering and Interpreting Language Models

Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement

Improving Instruction-Following in Language Models through Activation Steering

Adversarial Contrastive Decoding: Boosting Safety Alignment of Large Language Models via Opposite Prompt Optimization

Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering

Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs

Spectral Editing of Activations for Large Language Model Alignment

Towards Inference-time Category-wise Safety Steering for Large Language Models

LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement

Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories

PILL: Plug Into LLM with Adapter Expert and Attention Gate

Model Tells Itself Where to Attend: Faithfulness Meets Automatic Attention Steering

LLMSteer: Improving Long-Context LLM Inference by Steering Attention on Reused Contexts