Understanding Language Model Circuits through Knowledge Editing

Huaizhi Ge,Frank Rudzicz,Zining Zhu
2024-12-17
Abstract:Recent advances in language model interpretability have identified circuits, critical subnetworks that replicate model behaviors, yet how knowledge is structured within these crucial subnetworks remains opaque. To gain an understanding toward the knowledge in the circuits, we conduct systematic knowledge editing experiments on the circuits of the GPT-2 language model. Our analysis reveals intriguing patterns in how circuits respond to editing attempts, the extent of knowledge distribution across network components, and the architectural composition of knowledge-bearing circuits. These findings offer insights into the complex relationship between model circuits and knowledge representation, deepening the understanding of how information is organized within language models. Our findings offer novel insights into the ``meanings'' of the circuits, and introduce directions for further interpretability and safety research of language models.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to understand the knowledge structure inside "circuits" in large - language models (LLMs). Specifically, the author focuses on how knowledge is organized and distributed in these key sub - networks (i.e., circuits) and their roles in the model. ### Main problems: 1. **Understanding of knowledge structure**: Although recent research has identified key sub - networks (circuits) that can replicate model behaviors, how knowledge is organized inside these sub - networks remains an unsolved mystery. 2. **Impact of knowledge editing**: Through systematically conducting knowledge - editing experiments on the circuits of the GPT - 2 model, researchers hope to reveal the response patterns of these circuits to editing attempts, the degree of knowledge distribution in different network components, and the architectural composition of knowledge - bearing circuits. 3. **Relationship between circuits and knowledge representation**: Researchers hope that through these findings, they can gain in - depth understanding of the complex relationship between circuits and knowledge representation, thereby deepening the understanding of the information organization method inside the language model. ### Solutions: To answer these questions, the author adopts the following methods: 1. **Circuit extraction**: Use differentiable mask technology to extract circuits from the GPT - 2 model and ensure that these circuits are consistent with the complete model in task performance. 2. **Knowledge - editing experiments**: Conduct knowledge - editing experiments on circuits to evaluate the responses of different circuit components to knowledge modification. For example, change "A cat is an animal" to "A cat is a plant" and observe the behavior changes of the model. 3. **Circuit analysis**: Analyze circuits of different sizes (from 50% to 5% of the model parameters) to understand the knowledge distribution situation. The research also explores the circuit overlap between different datasets to reveal the patterns of knowledge organization. 4. **Architectural composition analysis**: Study the parameter ratios of each layer in the circuit (such as LayerNorm, attention mechanism, and multi - layer perceptron MLP) to understand their roles in knowledge storage. ### Key findings: 1. **Confirmation - bias behavior**: Knowledge - intensive circuits show stronger resistance to editing, indicating that information storage has a structured pattern. 2. **Knowledge distribution**: Ideal knowledge - bearing circuits are neither highly concentrated nor widely dispersed, but in between. 3. **Circuit overlap**: There is significant overlap between circuits of different tasks, especially in tasks involving hierarchical relationships and grammatical knowledge. 4. **Importance of LayerNorm**: The LayerNorm component occupies a large proportion in the circuit and may play an important role in maintaining network stability and information organization. Through these findings, the author provides new directions for the interpretability and security research of language models and puts forward suggestions for future research.