Can sparse autoencoders be used to decompose and interpret steering vectors?

Harry Mayne,Yushi Yang,Adam Mahdi

2024-11-14

Abstract:Steering vectors are a promising approach to control the behaviour of large language models. However, their underlying mechanisms remain poorly understood. While sparse autoencoders (SAEs) may offer a potential method to interpret steering vectors, recent findings show that SAE-reconstructed vectors often lack the steering properties of the original vectors. This paper investigates why directly applying SAEs to steering vectors yields misleading decompositions, identifying two reasons: (1) steering vectors fall outside the input distribution for which SAEs are designed, and (2) steering vectors can have meaningful negative projections in feature directions, which SAEs are not designed to accommodate. These limitations hinder the direct use of SAEs for interpreting steering vectors.

Machine Learning,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is whether Sparse Autoencoders (SAEs) can be used to decompose and interpret steering vectors. Steering vectors are a potentially useful method for controlling the behavior of large - language models, but their underlying mechanisms are not yet fully understood. Although SAEs may provide a way to interpret steering vectors, recent research has found that vectors reconstructed by SAEs usually lack the steering characteristics of the original steering vectors. Therefore, this paper aims to explore why directly applying SAEs to steering vectors leads to misleading decompositions and identifies two main reasons: 1. **Steering vectors are outside the input distribution for which SAEs are designed for decomposition**: The L2 - norm of steering vectors is significantly smaller than the L2 - norm of model activations, causing the encoder bias term of SAEs to have a disproportionate impact on the decomposition process, thus distorting the decomposition results. 2. **SAEs restrict decomposition to non - negative reconstruction coefficients**: Steering vectors may have meaningful negative projections in certain feature directions, but the design of SAEs does not allow for negative reconstruction coefficients, which hinders the capture of these negative projections. These two problems limit the effectiveness of directly using SAEs to interpret steering vectors. The paper supports the above views through theoretical analysis and experimental evidence and proposes directions for future research to overcome these problems.

Can sparse autoencoders be used to decompose and interpret steering vectors?

Improving Steering Vectors by Targeting Sparse Autoencoder Features

Analyzing the Generalization and Reliability of Steering Vectors

Decomposing The Dark Matter of Sparse Autoencoders

Disentangling Dense Embeddings with Sparse Autoencoders

Can sparse autoencoders make sense of latent representations?

Sequential sparse autoencoder for dynamic heading representation in ventral intraparietal area

Analyzing (In)Abilities of SAEs via Formal Languages

Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders

Neural Steerer: Novel Steering Vector Synthesis with a Causal Neural Field over Frequency and Source Positions

Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization

Steering Language Model Refusal with Sparse Autoencoders

Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small

Interpreting Attention Layer Outputs with Sparse Autoencoders

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering

Interpret the Internal States of Recommendation Model with Sparse Autoencoder

Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents

Rethinking Controllable Variational Autoencoders

Decoder Decomposition for the Analysis of the Latent Space of Nonlinear Autoencoders With Wind-Tunnel Experimental Data