Dissecting the Interplay of Attention Paths in a Statistical Mechanics Theory of Transformers

Lorenzo Tiberi,Francesca Mignacco,Kazuki Irie,Haim Sompolinsky
2024-05-25
Abstract:Despite the remarkable empirical performance of Transformers, their theoretical understanding remains elusive. Here, we consider a deep multi-head self-attention network, that is closely related to Transformers yet analytically tractable. We develop a statistical mechanics theory of Bayesian learning in this model, deriving exact equations for the network's predictor statistics under the finite-width thermodynamic limit, i.e., $N,P\rightarrow\infty$, $P/N=\mathcal{O}(1)$, where $N$ is the network width and $P$ is the number of training examples. Our theory shows that the predictor statistics are expressed as a sum of independent kernels, each one pairing different 'attention paths', defined as information pathways through different attention heads across layers. The kernels are weighted according to a 'task-relevant kernel combination' mechanism that aligns the total kernel with the task labels. As a consequence, this interplay between attention paths enhances generalization performance. Experiments confirm our findings on both synthetic and real-world sequence classification tasks. Finally, our theory explicitly relates the kernel combination mechanism to properties of the learned weights, allowing for a qualitative transfer of its insights to models trained via gradient descent. As an illustration, we demonstrate an efficient size reduction of the network, by pruning those attention heads that are deemed less relevant by our theory.
Machine Learning,Disordered Systems and Neural Networks,Statistical Mechanics
What problem does this paper attempt to address?
This paper explores the interaction of attention paths in Transformer models, especially the learning behavior of multi-layer multi-head self-attention networks within the framework of statistical mechanics theory. The researchers establish an analyzable model closely related to Transformer and accurately solve the predictive statistics of Bayesian learning under the finite-width thermodynamic limit (where the network width N and the number of training samples P tend to infinity, but the ratio P/N remains constant). They found that the predictive statistics can be represented as the weighted sum of independent kernels between different attention paths, which are formed by attention heads at different layers. The weight of each kernel is determined by a task-relevant mechanism that aligns the total kernel to the task labels, thereby enhancing generalization performance. In this way, the interaction between attention paths can improve the generalization capability of the Transformer. In addition, the paper provides an explanatory understanding of this mechanism by directly relating it to the magnitude and correlation of learned weights, which allows for transferring these insights from models trained with gradient descent. In specific applications, they demonstrate how to effectively reduce the network size by pruning attention heads that are considered less important according to the theory. In the experimental section, the paper validates these findings on synthetic data and real-world sequence classification tasks, demonstrating two main benefits of kernel composition: task-relevant weight allocation and correlation of attention paths. In conclusion, the paper aims to address the interpretability and generalization capability issues of Transformer models by delving into the interaction of attention paths and revealing key mechanisms for improving model performance.