Abstract:The Transformer architecture has inarguably revolutionized deep learning, overtaking classical architectures like multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs). At its core, the attention block differs in form and functionality from most other architectural components in deep learning -- to the extent that Transformers are often accompanied by adaptive optimizers, layer normalization, learning rate warmup, and more, in comparison to MLPs/CNNs. The root causes behind these outward manifestations, and the precise mechanisms that govern them, remain poorly understood. In this work, we bridge this gap by providing a fundamental understanding of what distinguishes the Transformer from the other architectures -- grounded in a theoretical comparison of the (loss) Hessian. Concretely, for a single self-attention layer, (a) we first entirely derive the Transformer's Hessian and express it in matrix derivatives; (b) we then characterize it in terms of data, weight, and attention moment dependencies; and (c) while doing so further highlight the important structural differences to the Hessian of classical networks. Our results suggest that various common architectural and optimization choices in Transformers can be traced back to their highly non-linear dependencies on the data and weight matrices, which vary heterogeneously across parameters. Ultimately, our findings provide a deeper understanding of the Transformer's unique optimization landscape and the challenges it poses.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to deeply understand the fundamental differences between the Transformer architecture and traditional deep learning architectures (such as Multilayer Perceptron MLP and Convolutional Neural Network CNN) through theoretical Hessian analysis. Specifically, the paper focuses on the following points: 1. **Uniqueness of Transformer**: - **Data Dependency**: Compared to traditional architectures, the data dependency of Transformer is more complex and nonlinear. - **Nonlinear Dependency of Weight Matrix**: The optimization process of Transformer is highly influenced by the nonlinear effects of data and weight matrices. - **Attention Mechanism**: The self-attention mechanism of Transformer introduces unique structural characteristics, and how these characteristics affect its optimization process. 2. **Optimization Challenges**: - **Optimizer Choice**: Why does Transformer usually require adaptive optimizers (such as Adam)? - **Layer Normalization**: Why is layer normalization needed in Transformer? - **Learning Rate Warm-up**: Why does Transformer need learning rate warm-up? 3. **Structure of the Hessian Matrix**: - **Derivation of the Hessian Matrix**: For a single self-attention layer, derive the Hessian matrix of Transformer and represent it as matrix derivatives. - **Dependency on Data, Weights, and Attention Matrix**: Analyze the high nonlinearity and heterogeneity of the Hessian matrix in different parameter groups. - **Comparison with Traditional Architectures**: Compare the Hessian matrix of Transformer with that of traditional architectures to reveal significant structural differences. Through these analyses, the paper hopes to provide deeper insights into the unique optimization landscape of Transformer and the challenges it brings. This not only helps explain why Transformer requires specific training techniques but also provides a theoretical foundation for further improving the Transformer architecture.

What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis

Transformers from an Optimization Perspective

Representational Strengths and Limitations of Transformers

Understanding the Difficulty of Training Transformers

The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit

How Transformers Utilize Multi-Head Attention in In-Context Learning? A Case Study on Sparse Linear Regression

Demystify Transformers & Convolutions in Modern Image Deep Networks

What comes after transformers? -- A selective survey connecting ideas in deep learning

How Well Can Transformers Emulate In-context Newton's Method?

Reducing the Transformer Architecture to a Minimum

Transformers are Universal In-context Learners

How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding

Linear attention is (maybe) all you need (to understand transformer optimization)

In-Context Convergence of Transformers

Unraveling the Gradient Descent Dynamics of Transformers

Transformers Learn Nonlinear Features In Context: Nonconvex Mean-field Dynamics on the Attention Landscape

The Attention Mechanism Demystiûed

Analyzing Transformers in Embedding Space

Understanding and Improving Transformer from a Multi-Particle Dynamic System Point of View.

Dissecting the Interplay of Attention Paths in a Statistical Mechanics Theory of Transformers

Transformers as Support Vector Machines