Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models

Asma Ghandeharioun,Avi Caciularu,Adam Pearce,Lucas Dixon,Mor Geva
2024-06-07
Abstract:Understanding the internal representations of large language models (LLMs) can help explain models' behavior and verify their alignment with human values. Given the capabilities of LLMs in generating human-understandable text, we propose leveraging the model itself to explain its internal representations in natural language. We introduce a framework called Patchscopes and show how it can be used to answer a wide range of questions about an LLM's computation. We show that many prior interpretability methods based on projecting representations into the vocabulary space and intervening on the LLM computation can be viewed as instances of this framework. Moreover, several of their shortcomings such as failure in inspecting early layers or lack of expressivity can be mitigated by Patchscopes. Beyond unifying prior inspection techniques, Patchscopes also opens up new possibilities such as using a more capable model to explain the representations of a smaller model, and multihop reasoning error correction.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the issue of understanding the internal representations of large language models (LLMs). Specifically, the authors propose a new framework—Patchscopes, which is used to interpret and inspect the hidden representations of LLMs. By leveraging the model's own generative capabilities, Patchscopes can explain its internal representations in natural language, thereby helping to interpret the model's behavior and verify its alignment with human values. ### Main Contributions 1. **Unifying Existing Methods**: The authors demonstrate that many existing interpretability methods (such as vocabulary space projection and computational intervention) can be seen as different configuration instances of Patchscopes. 2. **Overcoming Limitations of Existing Methods**: Patchscopes can alleviate some of the shortcomings of existing methods, such as the inability to effectively inspect representations in early layers or the lack of expressive power. 3. **Introducing New Possibilities**: Patchscopes not only unify existing techniques but also open up new research directions, such as using more powerful models to interpret the representations of smaller models and correcting multi-step reasoning errors. ### Experimental Results 1. **Decoding Next Word Prediction**: The authors evaluated the performance of Patchscopes in decoding next word predictions on multiple LLMs. The results show that starting from the 10th layer, the Token Identity Patchscope significantly outperforms other baseline methods across all models, with improvements of up to 98%. 2. **Extracting Specific Attributes**: The authors used Patchscopes to extract specific attributes (such as common sense and factual knowledge) from hidden representations and compared them with traditional linear probe methods. The results show that in 6 out of 12 tasks, Patchscopes significantly outperformed baseline methods without using training data (p<1e−5). ### Conclusion The paper proposes Patchscopes, a general and modular framework for decoding information from the hidden representations of LLMs. By demonstrating how Patchscopes can extend and improve existing methods, the authors prove the potential of this framework in enhancing model interpretability and practicality.