Attention-Only Transformers and Implementing MLPs with Attention Heads

Robert Huben,Valerie Morris

DOI: https://doi.org/10.48550/arXiv.2309.08593

IF: 5.414

2023-09-15

Machine Learning

Abstract:The transformer architecture is widely used in machine learning models and consists of two alternating sublayers: attention heads and MLPs. We prove that an MLP neuron can be implemented by a masked attention head with internal dimension 1 so long as the MLP's activation function comes from a restricted class including SiLU and close approximations of ReLU and GeLU. This allows one to convert an MLP-and-attention transformer into an attention-only transformer at the cost of greatly increasing the number of attention heads. We also prove that attention heads can perform the components of an MLP (linear transformations and activation functions) separately. Finally, we prove that attention heads can encode arbitrary masking patterns in their weight matrices to within arbitrarily small error.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to show how to use attention heads to achieve the functions of a multi - layer perceptron (MLP), thereby transforming the traditional Transformer model that contains attention mechanisms and MLP into a Transformer model that only uses attention mechanisms. Specifically, the paper proves the following points: 1. **MLP neurons can be implemented with masked attention heads**: As long as the activation function of the MLP belongs to a specific class of functions (including SiLU and its approximate forms ReLU and GeLU), the functions of MLP neurons can be achieved through a masked attention head with an internal dimension of 1. This makes it possible to transform a Transformer model that contains MLP and attention mechanisms into a Transformer model that only uses attention mechanisms, although this will greatly increase the number of attention heads. 2. **Attention heads can perform the linear transformation and activation function of MLP separately**: The paper proves that attention heads can perform the linear transformation and activation function operations in MLP separately. 3. **Attention heads can encode any mask pattern in the weight matrix**: The paper also proves that by adding specific terms to the weight matrix, attention heads can encode any mask pattern in the weight matrix with an arbitrarily small error. The significance of these results is that, in theory, a model with performance equivalent to that of the traditional Transformer model can be constructed by only using attention mechanisms, which provides new perspectives and methods for the study of the mechanism interpretability of Transformer models. In particular, this method can apply successful attention mechanism interpretation techniques to MLP layers, thereby promoting a more comprehensive understanding of Transformer models.

Attention-Only Transformers and Implementing MLPs with Attention Heads

MLP Can Be A Good Transformer Learner

Pay Attention to MLPs

Reducing the Transformer Architecture to a Minimum

A Primal-Dual Framework for Transformers and Neural Networks

Short-term Hebbian learning can implement transformer-like attention

Masked Hard-Attention Transformers Recognize Exactly the Star-Free Languages

Transformers are Universal In-context Learners

Representational Strengths and Limitations of Transformers

On the Role of Attention Masks and LayerNorm in Transformers

Agglomerative Attention

Generalized Probabilistic Attention Mechanism in Transformers

Attention as an RNN

Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators?

Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?

Mapping of attention mechanisms to a generalized Potts model

AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for Transformers

What Matters in Transformers? Not All Attention is Needed

The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit

Hierarchical Associative Memory, Parallelized MLP-Mixer, and Symmetry Breaking