Abstract:Transformer-based models, even though achieving super-human performance on several downstream tasks, are often regarded as a black box and used as a whole. It is still unclear what mechanisms they have learned, especially their core module: multi-head attention. Inspired by functional specialization in the human brain, which helps to efficiently handle multiple tasks, this work attempts to figure out whether the multi-head attention module will evolve similar function separation under multi-tasking training. If it is, can this mechanism further improve the model performance? To investigate these questions, we introduce an interpreting method to quantify the degree of functional specialization in multi-head attention. We further propose a simple multi-task training method to increase functional specialization and mitigate negative information transfer in multi-task learning. Experimental results on seven pre-trained transformer models have demonstrated that multi-head attention does evolve functional specialization phenomenon after multi-task training which is affected by the similarity of tasks. Moreover, the multi-task training strategy based on functional specialization boosts performance in both multi-task learning and transfer learning without adding any parameters.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are: 1. **Will the multi - head attention mechanism evolve the phenomenon of functional specialization in multi - task training?** Inspired by the functional specialization of the human brain, the author explores whether a similar functional separation phenomenon will occur in the multi - head attention mechanism during the multi - task training process. If this phenomenon occurs, what factors will affect the degree of functional specialization in the multi - head attention module? 2. **How to use the phenomenon of functional specialization to improve model performance?** If the multi - head attention mechanism does evolve the phenomenon of functional specialization, then how to use this phenomenon to further improve the performance of the model in multi - task learning and transfer learning? To study these problems, the author proposes the following methods: - **Interpretation method (IAP)**: Quantify the degree of functional specialization in multi - head attention through the Important Attention - head Pruning (IAP) method. Specific steps include calculating the importance scores of each attention head on different tasks, and then pruning the most important heads of each task to determine their impact on task performance. - **Utilization method (IAT)**: Propose a multi - task learning method, called Important Attention - head Training (IAT). By only training the most important attention head parts of each task, promote the separation of functions in the multi - head attention module, thereby alleviating the negative information transfer in multi - task learning and improving model performance. The experimental results show that the multi - head attention mechanism does evolve the phenomenon of functional specialization after multi - task training, and this phenomenon is affected by task similarity. In addition, the multi - task training strategy based on functional specialization improves the performance of multi - task learning and transfer learning without adding any parameters.

Interpreting and Exploiting Functional Specialization in Multi-Head Attention under Multi-task Learning

On the Optimization and Generalization of Multi-head Attention

How Transformers Utilize Multi-Head Attention in In-Context Learning? A Case Study on Sparse Linear Regression

Contributions of Transformer Attention Heads in Multi- and Cross-lingual Tasks

Improved Transformer with Multi-Head Dense Collaboration

Metaformer: A Transformer That Tends to Mine Metaphorical-Level Information

Multi-head or Single-head? An Empirical Comparison for Transformer Training

MTFormer: Multi-task Learning via Transformer and Cross-Task Reasoning.

Improving Transformers with Dynamically Composable Multi-Head Attention

How does Multi-Task Training Affect Transformer In-Context Capabilities? Investigations with Function Classes

Multi-Head Attention: Collaborate Instead of Concatenate

Superiority of Multi-Head Attention in In-Context Linear Regression

MAFormer: A transformer network with multi-scale attention fusion for visual recognition

Learning A Multi-Task Transformer Via Unified And Customized Instruction Tuning For Chest Radiograph Interpretation

Disentangling Representations through Multi-task Learning

MoH: Multi-Head Attention as Mixture-of-Head Attention

Contrastive Modules with Temporal Attention for Multi-Task Reinforcement Learning

Towards Understanding Multi-Task Learning (Generalization) of LLMs via Detecting and Exploring Task-Specific Neurons

Multi-Task Learning for Multiple Language Translation.

Unleashing the Power of Multi-Task Learning: A Comprehensive Survey Spanning Traditional, Deep, and Pretrained Foundation Model Eras

Attention as a Hypernetwork