Abstract:With the attention mechanism, transformers achieve significant empirical successes. Despite the intuitive understanding that transformers perform relational inference over long sequences to produce desirable representations, we lack a rigorous theory on how the attention mechanism achieves it. In particular, several intriguing questions remain open: (a) What makes a desirable representation? (b) How does the attention mechanism infer the desirable representation within the forward pass? (c) How does a pretraining procedure learn to infer the desirable representation through the backward pass? We observe that, as is the case in BERT and ViT, input tokens are often exchangeable since they already include positional encodings. The notion of exchangeability induces a latent variable model that is invariant to input sizes, which enables our theoretical analysis. - To answer (a) on representation, we establish the existence of a sufficient and minimal representation of input tokens. In particular, such a representation instantiates the posterior distribution of the latent variable given input tokens, which plays a central role in predicting output labels and solving downstream tasks. - To answer (b) on inference, we prove that attention with the desired parameter infers the latent posterior up to an approximation error, which is decreasing in input sizes. In detail, we quantify how attention approximates the conditional mean of the value given the key, which characterizes how it performs relational inference over long sequences. - To answer (c) on learning, we prove that both supervised and self-supervised objectives allow empirical risk minimization to learn the desired parameter up to a generalization error, which is independent of input sizes. Particularly, in the self-supervised setting, we identify a condition number that is pivotal to solving downstream tasks.

Attending Via Both Fine-tuning and Compressing.

Select & Re-Rank: Effectively and Efficiently Matching Multimodal Data with Dynamically Evolving Attention

Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization

Why Attentions May Not Be Interpretable?

Loss-Based Attention for Interpreting Image-Level Prediction of Convolutional Neural Networks.

An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models

ATCON: Attention Consistency for Vision Models

An Empirical Study of Spatial Attention Mechanisms in Deep Networks

Enhancing Learned Image Compression via Cross Window-based Attention

Text Compression-aided Transformer Encoding

Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration

Revisiting Attention Weights as Explanations from an Information Theoretic Perspective

Attention Interpretability Across NLP Tasks

Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets

FAM: Improving columnar vision transformer with feature attention mechanism

Learning When to Attend for Neural Machine Translation

Rethinking the role of attention mechanism: a causality perspective

On the Surprising Effectiveness of Attention Transfer for Vision Transformers

Shift-and-Balance Attention

COMCAT: Towards Efficient Compression and Customization of Attention-Based Vision Models

Where is the Model Looking At?--Concentrate and Explain the Network Attention