Abstract:We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the dot-product self-attention, positional encoding and feed-forward layer, affect its expressive power, and we study their combined effects through establishing explicit approximation rates. Our study reveals the roles of critical parameters in the Transformer, such as the number of layers and the number of attention heads. These theoretical insights are validated experimentally and offer natural suggestions for alternative architectures.

What problem does this paper attempt to address?

This paper mainly explores the expressive power of the Transformer model in sequence modeling and the working mechanism of its components. The Transformer model has achieved significant successes in multiple domains such as natural language processing, computer vision, and protein folding, but the understanding of its internal mechanisms and theoretical foundations is still limited. Researchers have systematically studied how key parameters of the Transformer model, such as the number of layers, attention heads, and feed-forward network width, affect its performance, and analyzed how these components individually and collectively affect its expressive power. They propose three different complexity sequence modeling tasks and establish explicit approximate rates for these tasks to understand the underlying working principles of the Transformer. 1. The study finds that deeper Transformer layers can handle more complex memory relationships, while a single-layer Transformer is also sufficient in some cases, especially when the relationships between memories are not very complex. 2. The attention layer and the feed-forward network layer have different roles, with the former responsible for extracting tokens from memory positions and the latter used to approximate non-linear memory functions and readout functions. 3. For relatively simple tasks, dot product attention (DP) is not necessary, but for more complex tasks, the cooperation between DP and relative position encodings (RPE) is crucial for extracting adaptive memories, and the non-linear function of DP is necessary, although there are more efficient alternatives. 4. RPE is effective in modeling long-range dependencies, especially for dealing with heavy-tailed memories, overcoming the "memory curse" faced by recurrent neural networks. In addition, the paper verifies these theoretical insights through experiments, compares them with existing literature, and discusses the advantages and disadvantages of various RPE and DP structures. These findings provide natural suggestions for improving the Transformer architecture and offer new perspectives for understanding how the Transformer works.

Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling

Approximation Rate of the Transformer Architecture for Sequence Modeling

An Intrinsic Dimension Perspective of Transformers for Sequential Modeling

On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding

Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

Understanding and Improving Transformer from a Multi-Particle Dynamic System Point of View.

Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View

Modeling Graph Structure in Transformer for Better AMR-to-Text Generation.

EulerFormer: Sequential User Behavior Modeling with Complex Vector Attention

Transformer Acceleration with Dynamic Sparse Attention

Dynamic Evaluation of Transformer Language Models

How Transformers Implement Induction Heads: Approximation and Optimization Analysis

Analyzing Transformer Dynamics as Movement through Embedding Space

Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory

What Can Transformer Learn with Varying Depth? Case Studies on Sequence Learning Tasks

Transformers are Expressive, But Are They Expressive Enough for Regression?

Toeplitz Neural Network for Sequence Modeling

Enhancing Transformer-based models for Long Sequence Time Series Forecasting via Structured Matrix

Your Transformer May Not be as Powerful as You Expect

A 28nm 27.5TOPS/W Approximate-Computing-Based Transformer Processor with Asymptotic Sparsity Speculating and Out-of-Order Computing.