Abstract:With the attention mechanism, transformers achieve significant empirical successes. Despite the intuitive understanding that transformers perform relational inference over long sequences to produce desirable representations, we lack a rigorous theory on how the attention mechanism achieves it. In particular, several intriguing questions remain open: (a) What makes a desirable representation? (b) How does the attention mechanism infer the desirable representation within the forward pass? (c) How does a pretraining procedure learn to infer the desirable representation through the backward pass? We observe that, as is the case in BERT and ViT, input tokens are often exchangeable since they already include positional encodings. The notion of exchangeability induces a latent variable model that is invariant to input sizes, which enables our theoretical analysis. - To answer (a) on representation, we establish the existence of a sufficient and minimal representation of input tokens. In particular, such a representation instantiates the posterior distribution of the latent variable given input tokens, which plays a central role in predicting output labels and solving downstream tasks. - To answer (b) on inference, we prove that attention with the desired parameter infers the latent posterior up to an approximation error, which is decreasing in input sizes. In detail, we quantify how attention approximates the conditional mean of the value given the key, which characterizes how it performs relational inference over long sequences. - To answer (c) on learning, we prove that both supervised and self-supervised objectives allow empirical risk minimization to learn the desired parameter up to a generalization error, which is independent of input sizes. Particularly, in the self-supervised setting, we identify a condition number that is pivotal to solving downstream tasks.

Deriving Machine Attention from Human Rationales

Attention in Reasoning: Dataset, Analysis, and Modeling

Exploring Distantly-Labeled Rationales in Neural Network Models

What to Learn, and How: Toward Effective Learning from Rationales

AiR: Attention with Reasoning Capability

Understanding More about Human and Machine Attention in Deep Neural Networks

Data-Centric Human Preference Optimization with Rationales

Beyond Accuracy: Ensuring Correct Predictions With Correct Rationales

An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models

Leveraging Machine-Generated Rationales to Facilitate Social Meaning Detection in Conversations

Making a (Counterfactual) Difference One Rationale at a Time

Human Vs Machine Attention in Neural Networks: A Comparative Study.

Rethinking the role of attention mechanism: a causality perspective

Attention: Marginal Probability is All You Need?

Interpreting Attention Models with Human Visual Attention in Machine Reading Comprehension

An Introductory Survey on Attention Mechanisms in NLP Problems

Inferring Human Attention by Learning Latent Intentions.

Enhancing the Rationale-Input Alignment for Self-explaining Rationalization

Faithful Attention Explainer: Verbalizing Decisions Based on Discriminative Features

Attention Is (not) All You Need for Commonsense Reasoning

[Cesium-137 in soil and vegetation from Spitsbergen and continental Norway (Suldal) 1981].