Abstract:With the attention mechanism, transformers achieve significant empirical successes. Despite the intuitive understanding that transformers perform relational inference over long sequences to produce desirable representations, we lack a rigorous theory on how the attention mechanism achieves it. In particular, several intriguing questions remain open: (a) What makes a desirable representation? (b) How does the attention mechanism infer the desirable representation within the forward pass? (c) How does a pretraining procedure learn to infer the desirable representation through the backward pass? We observe that, as is the case in BERT and ViT, input tokens are often exchangeable since they already include positional encodings. The notion of exchangeability induces a latent variable model that is invariant to input sizes, which enables our theoretical analysis. - To answer (a) on representation, we establish the existence of a sufficient and minimal representation of input tokens. In particular, such a representation instantiates the posterior distribution of the latent variable given input tokens, which plays a central role in predicting output labels and solving downstream tasks. - To answer (b) on inference, we prove that attention with the desired parameter infers the latent posterior up to an approximation error, which is decreasing in input sizes. In detail, we quantify how attention approximates the conditional mean of the value given the key, which characterizes how it performs relational inference over long sequences. - To answer (c) on learning, we prove that both supervised and self-supervised objectives allow empirical risk minimization to learn the desired parameter up to a generalization error, which is independent of input sizes. Particularly, in the self-supervised setting, we identify a condition number that is pivotal to solving downstream tasks.

Unveiling and Controlling Anomalous Attention Distribution in Transformers

AttentionViz: A Global View of Transformer Attention

Generalized Probabilistic Attention Mechanism in Transformers

How Does Attention Work in Vision Transformers? A Visual Analytics Attempt

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models

Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy

From Attention to Activation: Unravelling the Enigmas of Large Language Models

The Asymptotic Behavior of Attention in Transformers

Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets

Value Residual Learning For Alleviating Attention Concentration In Transformers

A Multiscale Visualization of Attention in the Transformer Model

Unmasking Transformers: A Theoretical Approach to Data Recovery via Attention Weights

The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit

Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Transformers

Axial Attention in Multidimensional Transformers

Explainability of Speech Recognition Transformers Via Gradient-Based Attention Visualization

Transformer Acceleration with Dynamic Sparse Attention

Curse of Attention: A Kernel-Based Perspective for Why Transformers Fail to Generalize on Time Series Forecasting and Beyond