Abstract:Effective representation learning from text has been an active area of research in the fields of NLP and text mining. Attention mechanisms have been at the forefront in order to learn contextual sentence representations. Current state-of-the-art approaches for many NLP tasks use large pre-trained language models such as BERT, XLNet and so on for learning representations. These models are based on the Transformer architecture that involves recurrent blocks of computation consisting of multi-head self-attention and feedforward networks. One of the major bottlenecks largely contributing to the computational complexity of the Transformer models is the self-attention layer, that is both computationally expensive and parameter intensive. In this work, we introduce a novel multi-head self-attention mechanism operating on GRUs that is shown to be computationally cheaper and more parameter efficient than self-attention mechanism proposed in Transformers for text classification tasks. The efficiency of our approach mainly stems from two optimizations; 1) we use low-rank matrix factorization of the affinity matrix to efficiently get multiple attention distributions instead of having separate parameters for each head 2) attention scores are obtained by querying a global context vector instead of densely querying all the words in the sentence. We evaluate the performance of the proposed model on tasks such as sentiment analysis from movie reviews, predicting business ratings from reviews and classifying news articles into topics. We find that the proposed approach matches or outperforms a series of strong baselines and is more parameter efficient than comparable multi-head approaches. We also perform qualitative analyses to verify that the proposed approach is interpretable and captures context-dependent word importance.

Low-Rank and Locality Constrained Self-Attention for Sequence Modeling.

Low Rank Factorization for Compact Multi-Head Self-Attention

Modeling Localness for Self-Attention Networks

Mechanics of Next Token Prediction with Self-Attention

Local Information Modeling with Self-Attention for Speaker Verification

Reinforced Self-Attention Network: a Hybrid of Hard and Soft Attention for Sequence Modeling

What Limits the Performance of Local Self-attention?

Character-Level Translation with Self-attention

Linear Log-Normal Attention with Unbiased Concentration

Low-Resolution Self-Attention for Semantic Segmentation

Multi-Scale Self-Attention for Text Classification

Is Attention All What You Need? -- An Empirical Investigation on Convolution-Based Active Memory and Self-Attention

Self attention mechanism of bidirectional information enhancement

Structured Self-Attention Weights Encode Semantics in Sentiment Analysis

Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory

Attention Is Not All You Need Anymore

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

Hiformer: Sequence Modeling Networks with Hierarchical Attention Mechanisms.

Joint Source-Target Self Attention with Locality Constraints

Local Slot Attention for Vision-and-Language Navigation