Studying Large Language Model Generalization with Influence Functions

Roger Grosse,Juhan Bae,Cem Anil,Nelson Elhage,Alex Tamkin,Amirhossein Tajdini,Benoit Steiner,Dustin Li,Esin Durmus,Ethan Perez,Evan Hubinger,Kamilė Lukošiūtė,Karina Nguyen,Nicholas Joseph,Sam McCandlish,Jared Kaplan,Samuel R. Bowman
2023-08-07
Abstract:When trying to gain better visibility into a machine learning model in order to understand and mitigate the associated risks, a potentially valuable source of evidence is: which training examples most contribute to a given behavior? Influence functions aim to answer a counterfactual: how would the model's parameters (and hence its outputs) change if a given sequence were added to the training set? While influence functions have produced insights for small models, they are difficult to scale to large language models (LLMs) due to the difficulty of computing an inverse-Hessian-vector product (IHVP). We use the Eigenvalue-corrected Kronecker-Factored Approximate Curvature (EK-FAC) approximation to scale influence functions up to LLMs with up to 52 billion parameters. In our experiments, EK-FAC achieves similar accuracy to traditional influence function estimators despite the IHVP computation being orders of magnitude faster. We investigate two algorithmic techniques to reduce the cost of computing gradients of candidate training sequences: TF-IDF filtering and query batching. We use influence functions to investigate the generalization patterns of LLMs, including the sparsity of the influence patterns, increasing abstraction with scale, math and programming abilities, cross-lingual generalization, and role-playing behavior. Despite many apparently sophisticated forms of generalization, we identify a surprising limitation: influences decay to near-zero when the order of key phrases is flipped. Overall, influence functions give us a powerful new tool for studying the generalization properties of LLMs.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to better understand and evaluate the behavior and risks of large - language models (LLMs). Specifically, the authors are concerned with how to determine which training samples contribute the most to the model's specific behaviors, thereby providing a method to understand the model's generalization ability in different scenarios. By using Influence Functions, they hope to answer a counterfactual question: if a given sequence is added to the training set, how will the model's parameters (and thus its output) change? ### Main problems 1. **Understanding model behavior**: How can we better understand the model's behavior by analyzing which training samples have the greatest impact on the model's specific behaviors? 2. **Evaluating risks**: How can we use this information to identify and mitigate risks related to the model, such as social biases, privacy leaks, and misinformation dissemination? 3. **Generalization ability**: How can we study the generalization patterns of large - language models on different tasks, including sparsity, degree of abstraction, mathematical and programming abilities, cross - language generalization, and role - playing behavior? ### Solutions To address the above problems, the authors propose the following methods: 1. **Extending influence functions**: By using the Eigenvalue - corrected Kronecker - Factored Approximate Curvature (EK - FAC) approximation method, extend the influence functions to large - language models with as many as 52 billion parameters. 2. **Optimizing computation**: Introduce TF - IDF filtering and query batching techniques to reduce the cost of computing training sample gradients. 3. **Multi - layer and token attribution**: Calculate not only the overall impact, but also locate specific network layers and tokens to more finely understand where knowledge is stored and generalization patterns. ### Experimental results 1. **Accuracy verification**: EK - FAC is comparable to the traditional LiSSA algorithm in the accuracy of influence estimation, but the calculation speed is significantly faster. 2. **Influence distribution**: The influence distribution is heavy - tailed and roughly follows a power - law distribution, indicating that the model's behavior is not directly memorized from a few training samples. 3. **Generalization levels**: Larger models usually generalize at a higher level, such as role - playing, programming, mathematical reasoning, and cross - language generalization. 4. **Inter - layer distribution**: The influence is roughly evenly distributed among different layers, but different layers exhibit different generalization patterns, and the intermediate layers are more concerned with abstract patterns. 5. **Word - order sensitivity**: Although the generalization pattern is generally complex, the influence function shows a high sensitivity to word order, that is, relevant phrases appearing in the prompt are more influential than those appearing in the completion. 6. **Role - playing behavior**: Role - playing behavior is mainly influenced by examples or descriptions of similar behaviors in the training set, indicating that these behaviors are more imitative than complex planning. Through these methods and experiments, the authors provide a powerful new tool for studying the generalization characteristics of large - language models and provide valuable insights for further risk management and model improvement.