Interpreting and Exploiting Functional Specialization in Multi-Head Attention under Multi-task Learning

Chong Li,Shaonan Wang,Yunhao Zhang,Jiajun Zhang,Chengqing Zong
2023-10-16
Abstract:Transformer-based models, even though achieving super-human performance on several downstream tasks, are often regarded as a black box and used as a whole. It is still unclear what mechanisms they have learned, especially their core module: multi-head attention. Inspired by functional specialization in the human brain, which helps to efficiently handle multiple tasks, this work attempts to figure out whether the multi-head attention module will evolve similar function separation under multi-tasking training. If it is, can this mechanism further improve the model performance? To investigate these questions, we introduce an interpreting method to quantify the degree of functional specialization in multi-head attention. We further propose a simple multi-task training method to increase functional specialization and mitigate negative information transfer in multi-task learning. Experimental results on seven pre-trained transformer models have demonstrated that multi-head attention does evolve functional specialization phenomenon after multi-task training which is affected by the similarity of tasks. Moreover, the multi-task training strategy based on functional specialization boosts performance in both multi-task learning and transfer learning without adding any parameters.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problems that this paper attempts to solve are: 1. **Will the multi - head attention mechanism evolve the phenomenon of functional specialization in multi - task training?** Inspired by the functional specialization of the human brain, the author explores whether a similar functional separation phenomenon will occur in the multi - head attention mechanism during the multi - task training process. If this phenomenon occurs, what factors will affect the degree of functional specialization in the multi - head attention module? 2. **How to use the phenomenon of functional specialization to improve model performance?** If the multi - head attention mechanism does evolve the phenomenon of functional specialization, then how to use this phenomenon to further improve the performance of the model in multi - task learning and transfer learning? To study these problems, the author proposes the following methods: - **Interpretation method (IAP)**: Quantify the degree of functional specialization in multi - head attention through the Important Attention - head Pruning (IAP) method. Specific steps include calculating the importance scores of each attention head on different tasks, and then pruning the most important heads of each task to determine their impact on task performance. - **Utilization method (IAT)**: Propose a multi - task learning method, called Important Attention - head Training (IAT). By only training the most important attention head parts of each task, promote the separation of functions in the multi - head attention module, thereby alleviating the negative information transfer in multi - task learning and improving model performance. The experimental results show that the multi - head attention mechanism does evolve the phenomenon of functional specialization after multi - task training, and this phenomenon is affected by task similarity. In addition, the multi - task training strategy based on functional specialization improves the performance of multi - task learning and transfer learning without adding any parameters.