Abstract:Recently, the attention mechanism boosts the performance of many neural network models in Natural Language Processing (NLP). Among the various attention mechanisms, Multi-Head Attention (MHA) is a powerful and popular variant. MHA helps the model to attend to different feature subspaces independently which is an essential component of Transformer. Despite its success, we conjecture that the different heads of the existing MHA may not collaborate properly. To validate this assumption and further improve the performance of Transformer, we study the collaboration problem for MHA in this paper. First, we propose the Single-Layer Collaboration (SLC) mechanism to help each attention head improve its attention distribution based on the feedback of other heads. Furthermore, we extend SLC to the cross-layer Multi-Head Dense Collaboration (MHDC) mechanism. MHDC helps each MHA layer learn the attention distributions considering the knowledge from the other MHA layers. Both SLC and MHDC are implemented as lightweight modules with very few additional parameters. When equipped with these modules, our new framework, i.e., Collaborative TransFormer ( CollFormer ), significantly outperforms the vanilla Transformer on a range of NLP tasks, including machine translation, sentence semantic relatedness, natural language inference, sentence classification, and reading comprehension. Besides, we also carry out extensive quantitative experiments to analyze the properties of the MHDC in different settings. The experimental results validate the effectiveness and universality of MHDC as well as CollFormer .

Generating Diverse Translation by Manipulating Multi-Head Attention

A Simple, Fast Diverse Decoding Algorithm for Neural Generation

Evade the Trap of Mediocrity: Promoting Diversity and Novelty in Text Generation Via Concentrating Attention

Alleviating the Inequality of Attention Heads for Neural Machine Translation

On the diversity of multi-head attention

Data Diversification: A Simple Strategy For Neural Machine Translation

Mixup Decoding for Diverse Machine Translation.

Handling Syntactic Divergence in Low-resource Machine Translation

Diversity-Promoting GAN: A Cross-Entropy Based Generative Adversarial Network for Diversified Text Generation

Multi-Unit Transformers for Neural Machine Translation

Towards Diverse Paraphrase Generation Using Multi-Class Wasserstein GAN

Self-Attention and Dynamic Convolution Hybrid Model for Neural Machine Translation

MonoFormer: One Transformer for Both Diffusion and Autoregression

Improved Transformer with Multi-Head Dense Collaboration

Train Once, and Decode As You Like.

Recurrence Boosts Diversity! Revisiting Recurrent Latent Variable in Transformer-Based Variational AutoEncoder for Diverse Text Generation

Go From the General to the Particular: Multi-Domain Translation with Domain Transformation Networks

DivGAN: Towards Diverse Paraphrase Generation via Diversified Generative Adversarial Network.

Multi-Hop Transformer for Document-Level Machine Translation

On the Optimization and Generalization of Multi-head Attention

X-Transformer: A Machine Translation Model Enhanced by the Self-Attention Mechanism