Abstract:With the attention mechanism, transformers achieve significant empirical successes. Despite the intuitive understanding that transformers perform relational inference over long sequences to produce desirable representations, we lack a rigorous theory on how the attention mechanism achieves it. In particular, several intriguing questions remain open: (a) What makes a desirable representation? (b) How does the attention mechanism infer the desirable representation within the forward pass? (c) How does a pretraining procedure learn to infer the desirable representation through the backward pass? We observe that, as is the case in BERT and ViT, input tokens are often exchangeable since they already include positional encodings. The notion of exchangeability induces a latent variable model that is invariant to input sizes, which enables our theoretical analysis. - To answer (a) on representation, we establish the existence of a sufficient and minimal representation of input tokens. In particular, such a representation instantiates the posterior distribution of the latent variable given input tokens, which plays a central role in predicting output labels and solving downstream tasks. - To answer (b) on inference, we prove that attention with the desired parameter infers the latent posterior up to an approximation error, which is decreasing in input sizes. In detail, we quantify how attention approximates the conditional mean of the value given the key, which characterizes how it performs relational inference over long sequences. - To answer (c) on learning, we prove that both supervised and self-supervised objectives allow empirical risk minimization to learn the desired parameter up to a generalization error, which is independent of input sizes. Particularly, in the self-supervised setting, we identify a condition number that is pivotal to solving downstream tasks.

Attention layers provably solve single-location regression

What can a Single Attention Layer Learn? A Study Through the Random Features Lens

Provably learning a multi-head attention layer

Latte: Latent Attention for Linear Time Transformers

AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for Transformers

How Transformers Utilize Multi-Head Attention in In-Context Learning? A Case Study on Sparse Linear Regression

Mechanics of Next Token Prediction with Self-Attention

Attention-Linear Trajectory Prediction

Low-Rank and Locality Constrained Self-Attention for Sequence Modeling.

Superiority of Multi-Head Attention in In-Context Linear Regression

An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

Attend First, Consolidate Later: On the Importance of Attention in Different LLM Layers

Single Headed Attention RNN: Stop Thinking With Your Head

One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention

Explaining Modern Gated-Linear RNNs via a Unified Implicit Attention Formulation

On the Expressive Power of Self-Attention Matrices

Attention as an RNN

In-Context Learning for Attention Scheme: from Single Softmax Regression to Multiple Softmax Regression via a Tensor Trick

The Closeness of In-Context Learning and Weight Shifting for Softmax Regression