Abstract:Understanding the internal representations of large language models (LLMs) can help explain models' behavior and verify their alignment with human values. Given the capabilities of LLMs in generating human-understandable text, we propose leveraging the model itself to explain its internal representations in natural language. We introduce a framework called Patchscopes and show how it can be used to answer a wide range of questions about an LLM's computation. We show that many prior interpretability methods based on projecting representations into the vocabulary space and intervening on the LLM computation can be viewed as instances of this framework. Moreover, several of their shortcomings such as failure in inspecting early layers or lack of expressivity can be mitigated by Patchscopes. Beyond unifying prior inspection techniques, Patchscopes also opens up new possibilities such as using a more capable model to explain the representations of a smaller model, and multihop reasoning error correction.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the issue of understanding the internal representations of large language models (LLMs). Specifically, the authors propose a new framework—Patchscopes, which is used to interpret and inspect the hidden representations of LLMs. By leveraging the model's own generative capabilities, Patchscopes can explain its internal representations in natural language, thereby helping to interpret the model's behavior and verify its alignment with human values. ### Main Contributions 1. **Unifying Existing Methods**: The authors demonstrate that many existing interpretability methods (such as vocabulary space projection and computational intervention) can be seen as different configuration instances of Patchscopes. 2. **Overcoming Limitations of Existing Methods**: Patchscopes can alleviate some of the shortcomings of existing methods, such as the inability to effectively inspect representations in early layers or the lack of expressive power. 3. **Introducing New Possibilities**: Patchscopes not only unify existing techniques but also open up new research directions, such as using more powerful models to interpret the representations of smaller models and correcting multi-step reasoning errors. ### Experimental Results 1. **Decoding Next Word Prediction**: The authors evaluated the performance of Patchscopes in decoding next word predictions on multiple LLMs. The results show that starting from the 10th layer, the Token Identity Patchscope significantly outperforms other baseline methods across all models, with improvements of up to 98%. 2. **Extracting Specific Attributes**: The authors used Patchscopes to extract specific attributes (such as common sense and factual knowledge) from hidden representations and compared them with traditional linear probe methods. The results show that in 6 out of 12 tasks, Patchscopes significantly outperformed baseline methods without using training data (p<1e−5). ### Conclusion The paper proposes Patchscopes, a general and modular framework for decoding information from the hidden representations of LLMs. By demonstrating how Patchscopes can extend and improve existing methods, the authors prove the potential of this framework in enhancing model interpretability and practicality.

Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models

Towards Uncovering How Large Language Model Works: An Explainability Perspective

Understanding and Patching Compositional Reasoning in LLMs

PATCH! Psychometrics-AssisTed benCHmarking of Large Language Models: A Case Study of Proficiency in 8th Grade Mathematics

Patchfinder: Leveraging Visual Language Models for Accurate Information Retrieval using Model Uncertainty

Unveiling LLMs: The Evolution of Latent Representations in a Dynamic Knowledge Graph

Patch-CLIP: A Patch-Text Pre-Trained Model

From Understanding to Utilization: A Survey on Explainability for Large Language Models

Fixing Model Bugs with Natural Language Patches

Proto-lm: A Prototypical Network-Based Framework for Built-in Interpretability in Large Language Models

Patched RTC: evaluating LLMs for diverse software development tasks

Sparsity-Guided Holistic Explanation for LLMs with Interpretable Inference-Time Intervention

A Concept-Based Explainability Framework for Large Multimodal Models

Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries

Patch-Level Training for Large Language Models

From Form(s) to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency

Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models

LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models

Quantifying and Enabling the Interpretability of CLIP-like Models

Frame Representation Hypothesis: Multi-Token LLM Interpretability and Concept-Guided Text Generation

Interpretability of Language Models via Task Spaces