Abstract:Transformers are extremely successful machine learning models whose mathematical properties remain poorly understood. Here, we rigorously characterize the behavior of transformers with hardmax self-attention and normalization sublayers as the number of layers tends to infinity. By viewing such transformers as discrete-time dynamical systems describing the evolution of points in a Euclidean space, and thanks to a geometric interpretation of the self-attention mechanism based on hyperplane separation, we show that the transformer inputs asymptotically converge to a clustered equilibrium determined by special points called leaders. We then leverage this theoretical understanding to solve sentiment analysis problems from language processing using a fully interpretable transformer model, which effectively captures `context' by clustering meaningless words around leader words carrying the most meaning. Finally, we outline remaining challenges to bridge the gap between the mathematical analysis of transformers and their real-life implementation.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to understand the role of the self - attention mechanism in the Transformer model in deep learning, especially the behavioral characteristics of the pure - attention hardmax Transformer when the number of layers tends to infinity. Specifically, by regarding such a Transformer as a discrete - time dynamical system that describes the evolution of points in Euclidean space and based on the geometric interpretation of hyperplane separation, the author shows that the input vectors asymptotically converge to a clustering equilibrium state determined by special points (called leaders). In addition, the author uses this theoretical understanding to solve the sentiment analysis problem in natural language processing and constructs a fully interpretable Transformer model, which effectively captures "context" by clustering meaningless words around the leading words that carry the most meaning. ### Main Contributions 1. **Theoretical Contribution**: The author proves that as the number of layers increases, the input vectors will converge to a clustering equilibrium state consisting of leaders or their specific convex combinations. A leader refers to an input vector to which the self - attention mechanism only pays attention in a certain layer. This theoretical result reveals the key role of the self - attention mechanism in the Transformer model, that is, providing context information through the clustering mechanism and filtering out unimportant words. 2. **Computational Contribution**: The author applies the theoretical results to practical tasks and constructs an interpretable Transformer model for sentiment analysis. The model consists of three components: an encoder, a Transformer layer, and a decoder. The encoder maps words to vectors in a high - dimensional space, the Transformer layer updates the value of each vector, and the decoder predicts the sentiment tendency (positive or negative) of the text according to the final vector values. Experimental results show that the clustering mechanism does play a role in providing context, improving the interpretability and accuracy of the model by clustering meaningless words around meaningful words. ### Related Work - **Approximation - theory Perspective**: Other researchers have explained the success of the Transformer model from the perspective of approximation theory, believing that they can approximate any continuous equivariant sequence - to - sequence function with arbitrary precision. - **Dynamical - system Perspective**: Some studies regard the Transformer model as the Euler discretization of neural ordinary differential equations (NODEs), or as a time - stepping scheme for solving ordinary differential equations of particles affected by convection and diffusion effects. These continuous - time models are convenient for applying known dynamical - system analysis tools, but the connection with the actual discrete - layer Transformer model still needs to be further strictly proven. ### Distinguishing Features 1. **Hardmax**: The author uses the self - attention mechanism in the form of hardmax instead of the commonly used softmax. Hardmax has a clearer geometric interpretation, which helps to reveal the key role of leaders in the Transformer dynamics and avoids the metastable phenomenon that may occur in the softmax model. 2. **Discrete - time Framework**: The author abandons the continuous - time framework and directly conducts the analysis at the discrete - time level. This avoids the well - posedness problem caused by the non - smoothness of the hardmax form, but also increases the difficulty of the proof. ### Conclusion Through rigorous mathematical analysis, this paper reveals the clustering behavior of the self - attention mechanism in the Transformer model and its application value in practical tasks such as sentiment analysis. These theoretical and experimental results not only deepen the understanding of the Transformer model, but also provide new ideas for designing more efficient and interpretable deep - learning models.

Clustering in pure-attention hardmax transformers and its role in sentiment analysis

ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers

Centroid Transformers: Learning to Abstract with Attention

A Primal-Dual Framework for Transformers and Neural Networks

Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

Transformers as Support Vector Machines

Representational Strengths and Limitations of Transformers

CAST: Clustering Self-Attention using Surrogate Tokens for Efficient Transformers

The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit

Interpreting Transformers for Jet Tagging

Transformers are Universal In-context Learners

Cluster-Former: Clustering-based Sparse Transformer for Question Answering.

On the Role of Attention Masks and LayerNorm in Transformers

The Power of Hard Attention Transformers on Data Sequences: A Formal Language Theoretic Perspective

Attribute graph clustering via transformer and graph attention autoencoder

Clustering in Causal Attention Masking

How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding

Agent Attention: On the Integration of Softmax and Linear Attention

QClusformer: A Quantum Transformer-based Framework for Unsupervised Visual Clustering

A mathematical perspective on Transformers

Attention as a Hypernetwork