From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning

Xuansheng Wu,Wenlin Yao,Jianshu Chen,Xiaoman Pan,Xiaoyang Wang,Ninghao Liu,Dong Yu
2024-04-05
Abstract:Large Language Models (LLMs) have achieved remarkable success, where instruction tuning is the critical step in aligning LLMs with user intentions. In this work, we investigate how the instruction tuning adjusts pre-trained models with a focus on intrinsic changes. Specifically, we first develop several local and global explanation methods, including a gradient-based method for input-output attribution, and techniques for interpreting patterns and concepts in self-attention and feed-forward layers. The impact of instruction tuning is then studied by comparing the explanations derived from the pre-trained and instruction-tuned models. This approach provides an internal perspective of the model shifts on a human-comprehensible level. Our findings reveal three significant impacts of instruction tuning: 1) It empowers LLMs to recognize the instruction parts of user prompts, and promotes the response generation constantly conditioned on the instructions. 2) It encourages the self-attention heads to capture more word-word relationships about instruction verbs. 3) It encourages the feed-forward networks to rotate their pre-trained knowledge toward user-oriented tasks. These insights contribute to a more comprehensive understanding of instruction tuning and lay the groundwork for future work that aims at explaining and optimizing LLMs for various applications. Our code and data are publicly available at <a class="link-external link-https" href="https://github.com/JacksonWuxs/Interpret_Instruction_Tuning_LLMs" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to explore how instruction tuning changes pre-trained large language models (LLMs) to better understand the changes in their internal mechanisms. Specifically, the paper focuses on the following two aspects: 1. **How instruction tuning enables the model to recognize the instruction part in user prompts and continuously generate responses based on these instructions**: - By developing a series of interpretative methods, including gradient-based methods to attribute the relationship between input and output, and interpreting patterns and concepts in self-attention layers and feedforward networks. - Comparing the differences in interpretation between pre-trained models and instruction-tuned models to provide an understandable internal perspective. 2. **The specific impact of instruction tuning on self-attention heads and feedforward networks**: - Investigating how self-attention heads capture more word-to-word relationships related to instruction verbs. - Exploring how feedforward networks rotate pre-trained knowledge to user-oriented tasks without changing their language structure. ### Main Findings 1. **Instruction tuning enables the model to recognize the instruction part in user prompts and continuously generate responses based on these instructions**: - By improving traditional gradient-based methods with normalization strategies, it was found that instruction words (such as "correct grammar:") have an impact on multiple response words across different positions, while the impact of other words is limited. - By aggregating the overall importance of each prompt word through density functions, it quantitatively shows that these importance density scores are strongly correlated with the model's ability to follow instructions. 2. **Instruction tuning encourages self-attention heads to learn more word-to-word relationships related to instruction verbs**: - Proposing a method to extract word-to-word patterns under the local co-occurrence assumption to mitigate the ambiguity challenges in self-attention head interpretation. - It was found that after instruction tuning, the word-to-word patterns within the same self-attention head changed significantly, especially in the lower and middle layers where patterns related to instruction verbs became more prevalent. 3. **Instruction tuning adapts pre-trained knowledge in feedforward networks to user-oriented tasks without changing their language structure**: - Proposing to interpret the principal components of weight vectors to achieve "concept"-level interpretation. - Analyzing the distribution of these concepts in user-oriented tasks and language levels, it was found that the proportion of concepts suitable for specific tasks increased significantly, while the distribution of these concepts across different language levels remained unchanged. ### Conclusion This study reveals the key role of instruction words in instruction-tuned models, emphasizing the unique contributions of self-attention mechanisms and feedforward networks in this regard. These findings not only help to understand the internal mechanisms of instruction tuning but also provide a foundation for future optimization and interpretation of LLMs.