Abstract:Despite the remarkable empirical performance of Transformers, their theoretical understanding remains elusive. Here, we consider a deep multi-head self-attention network, that is closely related to Transformers yet analytically tractable. We develop a statistical mechanics theory of Bayesian learning in this model, deriving exact equations for the network's predictor statistics under the finite-width thermodynamic limit, i.e., $N,P\rightarrow\infty$, $P/N=\mathcal{O}(1)$, where $N$ is the network width and $P$ is the number of training examples. Our theory shows that the predictor statistics are expressed as a sum of independent kernels, each one pairing different 'attention paths', defined as information pathways through different attention heads across layers. The kernels are weighted according to a 'task-relevant kernel combination' mechanism that aligns the total kernel with the task labels. As a consequence, this interplay between attention paths enhances generalization performance. Experiments confirm our findings on both synthetic and real-world sequence classification tasks. Finally, our theory explicitly relates the kernel combination mechanism to properties of the learned weights, allowing for a qualitative transfer of its insights to models trained via gradient descent. As an illustration, we demonstrate an efficient size reduction of the network, by pruning those attention heads that are deemed less relevant by our theory.

What problem does this paper attempt to address?

This paper explores the interaction of attention paths in Transformer models, especially the learning behavior of multi-layer multi-head self-attention networks within the framework of statistical mechanics theory. The researchers establish an analyzable model closely related to Transformer and accurately solve the predictive statistics of Bayesian learning under the finite-width thermodynamic limit (where the network width N and the number of training samples P tend to infinity, but the ratio P/N remains constant). They found that the predictive statistics can be represented as the weighted sum of independent kernels between different attention paths, which are formed by attention heads at different layers. The weight of each kernel is determined by a task-relevant mechanism that aligns the total kernel to the task labels, thereby enhancing generalization performance. In this way, the interaction between attention paths can improve the generalization capability of the Transformer. In addition, the paper provides an explanatory understanding of this mechanism by directly relating it to the magnitude and correlation of learned weights, which allows for transferring these insights from models trained with gradient descent. In specific applications, they demonstrate how to effectively reduce the network size by pruning attention heads that are considered less important according to the theory. In the experimental section, the paper validates these findings on synthetic data and real-world sequence classification tasks, demonstrating two main benefits of kernel composition: task-relevant weight allocation and correlation of attention paths. In conclusion, the paper aims to address the interpretability and generalization capability issues of Transformer models by delving into the interaction of attention paths and revealing key mechanisms for improving model performance.

Dissecting the Interplay of Attention Paths in a Statistical Mechanics Theory of Transformers

Dynamical Mean-Field Theory of Self-Attention Neural Networks

Generalized Probabilistic Attention Mechanism in Transformers

The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit

Improving Transformers with Probabilistic Attention Keys

The Asymptotic Behavior of Attention in Transformers

How Transformers Utilize Multi-Head Attention in In-Context Learning? A Case Study on Sparse Linear Regression

Attention as a Hypernetwork

Transformers on Markov Data: Constant Depth Suffices

Transformers are Universal In-context Learners

A Primal-Dual Framework for Transformers and Neural Networks

Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

Representational Strengths and Limitations of Transformers

Curse of Attention: A Kernel-Based Perspective for Why Transformers Fail to Generalize on Time Series Forecasting and Beyond

Self-attention as an attractor network: transient memories without backpropagation

Are queries and keys always relevant? A case study on Transformer wave functions

Memorization Capacity of Multi-Head Attention in Transformers

Mapping of attention mechanisms to a generalized Potts model

An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models

Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers

$k$NN Attention Demystified: A Theoretical Exploration for Scalable Transformers