Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling

Mingze Wang,Weinan E
2024-07-03
Abstract:We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the dot-product self-attention, positional encoding and feed-forward layer, affect its expressive power, and we study their combined effects through establishing explicit approximation rates. Our study reveals the roles of critical parameters in the Transformer, such as the number of layers and the number of attention heads. These theoretical insights are validated experimentally and offer natural suggestions for alternative architectures.
Machine Learning
What problem does this paper attempt to address?
This paper mainly explores the expressive power of the Transformer model in sequence modeling and the working mechanism of its components. The Transformer model has achieved significant successes in multiple domains such as natural language processing, computer vision, and protein folding, but the understanding of its internal mechanisms and theoretical foundations is still limited. Researchers have systematically studied how key parameters of the Transformer model, such as the number of layers, attention heads, and feed-forward network width, affect its performance, and analyzed how these components individually and collectively affect its expressive power. They propose three different complexity sequence modeling tasks and establish explicit approximate rates for these tasks to understand the underlying working principles of the Transformer. 1. The study finds that deeper Transformer layers can handle more complex memory relationships, while a single-layer Transformer is also sufficient in some cases, especially when the relationships between memories are not very complex. 2. The attention layer and the feed-forward network layer have different roles, with the former responsible for extracting tokens from memory positions and the latter used to approximate non-linear memory functions and readout functions. 3. For relatively simple tasks, dot product attention (DP) is not necessary, but for more complex tasks, the cooperation between DP and relative position encodings (RPE) is crucial for extracting adaptive memories, and the non-linear function of DP is necessary, although there are more efficient alternatives. 4. RPE is effective in modeling long-range dependencies, especially for dealing with heavy-tailed memories, overcoming the "memory curse" faced by recurrent neural networks. In addition, the paper verifies these theoretical insights through experiments, compares them with existing literature, and discusses the advantages and disadvantages of various RPE and DP structures. These findings provide natural suggestions for improving the Transformer architecture and offer new perspectives for understanding how the Transformer works.