Abstract:With the attention mechanism, transformers achieve significant empirical successes. Despite the intuitive understanding that transformers perform relational inference over long sequences to produce desirable representations, we lack a rigorous theory on how the attention mechanism achieves it. In particular, several intriguing questions remain open: (a) What makes a desirable representation? (b) How does the attention mechanism infer the desirable representation within the forward pass? (c) How does a pretraining procedure learn to infer the desirable representation through the backward pass? We observe that, as is the case in BERT and ViT, input tokens are often exchangeable since they already include positional encodings. The notion of exchangeability induces a latent variable model that is invariant to input sizes, which enables our theoretical analysis. - To answer (a) on representation, we establish the existence of a sufficient and minimal representation of input tokens. In particular, such a representation instantiates the posterior distribution of the latent variable given input tokens, which plays a central role in predicting output labels and solving downstream tasks. - To answer (b) on inference, we prove that attention with the desired parameter infers the latent posterior up to an approximation error, which is decreasing in input sizes. In detail, we quantify how attention approximates the conditional mean of the value given the key, which characterizes how it performs relational inference over long sequences. - To answer (c) on learning, we prove that both supervised and self-supervised objectives allow empirical risk minimization to learn the desired parameter up to a generalization error, which is independent of input sizes. Particularly, in the self-supervised setting, we identify a condition number that is pivotal to solving downstream tasks.

Structural analysis of an all-purpose question answering model

Answer, Assemble, Ace: Understanding How Transformers Answer Multiple Choice Questions

An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models

Pay More Attention - Neural Architectures for Question-Answering

Positional Attention Guided Transformer-Like Architecture for Visual Question Answering

Attention Can Reflect Syntactic Structure (If You Let It)

Attention Flows: Analyzing and Comparing Attention Mechanisms in Language Models

Syntax-informed Question Answering with Heterogeneous Graph Transformer

Multimodal Graph Transformer for Multimodal Question Answering

Enhancing Pre-trained Models with Text Structure Knowledge for Question Generation

Cluster-Former: Clustering-based Sparse Transformer for Question Answering.

Attention as a Hypernetwork

A lightweight Transformer-based visual question answering network with Weight-Sharing Hybrid Attention

Opening the Black Box: Analyzing Attention Weights and Hidden States in Pre-trained Language Models for Non-language Tasks

Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

A Step Closer to Comprehensive Answers: Constrained Multi-Stage Question Decomposition with Large Language Models

Self-Attention is All You Need

Task-driven Visual Saliency and Attention-based Visual Question Answering

Neural Abstractive Summarization with Structural Attention

Modular Blended Attention Network for Video Question Answering

You Only Need One Model for Open-domain Question Answering