Abstract:Scaled Dot Product Attention (SDPA) is the backbone of many modern deep-learning models. It is so versatile that it has been used in natural language, vision, and multi-modal domains with very little change compared to its original formulation. This paper discusses why the current formulation is inefficient by delving into the mathematical details of the attention mechanism. We propose three improvements to mitigate these inefficiencies, thereby, introducing three enhanced attention mechanisms: Optimised, Efficient, and Super Attention. Optimised and Efficient Attention have one and two matrix multiplications fewer per head, respectively, and 25% and 50% fewer parameters, respectively, than standard SDPA, but perform similarly to standard SDPA in both vision and natural language tasks. They can be used in all applications where SDPA is used while offering smaller model sizes and faster training and inference without noticeable loss in performance. Super Attention introduces a new linear transformation on the values, transforming them from the left. It outperforms standard SPDA on vision and natural language tasks by up to 17% while having one fewer matrix multiplication per head and 25% fewer parameters than standard SDPA. Consequently, it is also faster than standard SDPA. Super Attention is ideal in applications where the attention layer's context length is fixed, such as Vision Transformers. In addition to providing mathematical reasoning, we evaluate the presented attention mechanisms on several datasets including MNIST, CIFAR100, ImageNet, IMDB Movie Reviews, and Amazon Reviews datasets, as well as combined Europarl and Anki English-Spanish datasets for neural machine translation.

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

On the Benefits of Rank in Attention Layers

On the Role of Attention Masks and LayerNorm in Transformers

The Depth-to-Width Interplay in Self-Attention

The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit

Are Sixteen Heads Really Better than One?

Implicit Bias and Fast Convergence Rates for Self-attention

How Smooth Is Attention?

MLP Can Be A Good Transformer Learner

Attention Is Not All You Need Anymore

Representational Strengths and Limitations of Transformers

Attention is All you Need

Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

Dissecting the Interplay of Attention Paths in a Statistical Mechanics Theory of Transformers

You Need to Pay Better Attention: Rethinking the Mathematics of Attention Mechanism

Centered Self-Attention Layers

On the Relationship between Self-Attention and Convolutional Layers

On the Optimization and Generalization of Multi-head Attention

On the Expressive Power of Self-Attention Matrices

Self-attention as an attractor network: transient memories without backpropagation