Abstract:Pre-trained language models (PLM) have demonstrated their effectiveness for a broad range of information retrieval and natural language processing tasks. As the core part of PLM, multi-head self-attention is appealing for its ability to jointly attend to information from different positions. However, researchers have found that PLM always exhibits fixed attention patterns regardless of the input (e.g., excessively paying attention to [CLS] or [SEP]), which we argue might neglect important information in the other positions. In this work, we propose a simple yet effective attention guiding mechanism to improve the performance of PLM by encouraging attention towards the established goals. Specifically, we propose two kinds of attention guiding methods, i.e., map discrimination guiding (MDG) and attention pattern decorrelation guiding (PDG). The former definitely encourages the diversity among multiple self-attention heads to jointly attend to information from different representation subspaces, while the latter encourages self-attention to attend to as many different positions of the input as possible. We conduct experiments with multiple general pre-trained models (i.e., BERT, ALBERT, and Roberta) and domain-specific pre-trained models (i.e., BioBERT, ClinicalBERT, BlueBert, and SciBERT) on three benchmark datasets (i.e., MultiNLI, MedNLI, and Cross-genre-IR). Extensive experimental results demonstrate that our proposed MDG and PDG bring stable performance improvements on all datasets with high efficiency and low cost.

Self-Attention with Cross-Lingual Position Representation

Multiple Structural Priors Guided Self Attention Network for Language Understanding

Deps-SAN: Neural Machine Translation with Dependency-Scaled Self-Attention Network

Boosting Neural Machine Translation with Dependency-Scaled Self-Attention Network.

Investigating Self-Attention Network for Chinese Word Segmentation

How Does Selective Mechanism Improve Self-Attention Networks?

DiSAN: Directional Self-Attention Network for RNN/CNN-Free Language Understanding

Translation as Cross-Domain Knowledge - Attention Augmentation for Unsupervised Cross-Domain Segmenting and Labeling Tasks.

Cross-Align: Modeling Deep Cross-lingual Interactions for Word Alignment

Learning Multilingual Representation for Natural Language Understanding with Enhanced Cross-Lingual Supervision

Bridging Cross-Lingual Gaps During Leveraging the Multilingual Sequence-to-Sequence Pretraining for Text Generation and Understanding

SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition

Reinforced Self-Attention Network: a Hybrid of Hard and Soft Attention for Sequence Modeling

Speech-XLNet: Unsupervised Acoustic Model Pretraining for Self-Attention Networks

SG-Net: Syntax Guided Transformer for Language Representation

Phrase-level Self-Attention Networks for Universal Sentence Encoding

Cross2StrA: Unpaired Cross-lingual Image Captioning with Cross-lingual Cross-modal Structure-pivoted Alignment

Paying More Attention to Self-attention: Improving Pre-trained Language Models via Attention Guiding

CSMA-CNER:Multi-modal Chinese NER Task with Cross- and Self-Modality Attention

Neural Task Representations as Weak Supervision for Model Agnostic Cross-Lingual Transfer

Context-Aware Cross-Attention for Non-Autoregressive Translation