Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis

Zeping Yu,Sophia Ananiadou
2024-09-21
Abstract:We find arithmetic ability resides within a limited number of attention heads, with each head specializing in distinct operations. To delve into the reason, we introduce the Comparative Neuron Analysis (CNA) method, which identifies an internal logic chain consisting of four distinct stages from input to prediction: feature enhancing with shallow FFN neurons, feature transferring by shallow attention layers, feature predicting by arithmetic heads, and prediction enhancing among deep FFN neurons. Moreover, we identify the human-interpretable FFN neurons within both feature-enhancing and feature-predicting stages. These findings lead us to investigate the mechanism of LoRA, revealing that it enhances prediction probabilities by amplifying the coefficient scores of FFN neurons related to predictions. Finally, we apply our method in model pruning for arithmetic tasks and model editing for reducing gender bias. Code is on <a class="link-external link-https" href="https://github.com/zepingyu0512/arithmetic-mechanism" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issue of understanding the mechanisms of large language models (LLMs) in arithmetic tasks. Specifically: 1. **Identifying Key Parameters**: Researchers have found that only a few attention heads significantly impact arithmetic performance and attempt to explore the specific working mechanisms of these heads and how they affect the feedforward neural network (FFN) layers. 2. **Constructing Internal Logical Chains**: The paper proposes the Comparative Neuron Analysis (CNA) method, which identifies four stages from input to prediction: feature enhancement, feature transmission, feature prediction, and prediction enhancement. 3. **Explaining the LoRA Mechanism**: Using the CNA method, the paper explores the working principle of LoRA, discovering that LoRA improves prediction probability by amplifying the coefficient scores of FFN neurons related to the final prediction. 4. **Application and Optimization**: Based on the above findings, the paper designs a model pruning method for arithmetic tasks and a model editing method to reduce gender bias. Overall, this paper is dedicated to revealing the internal mechanisms of large language models in handling arithmetic tasks through detailed experiments and analyses, and proposes improvement methods to optimize model performance.