Attention-Only Transformers and Implementing MLPs with Attention Heads

Robert Huben,Valerie Morris
DOI: https://doi.org/10.48550/arXiv.2309.08593
IF: 5.414
2023-09-15
Machine Learning
Abstract:The transformer architecture is widely used in machine learning models and consists of two alternating sublayers: attention heads and MLPs. We prove that an MLP neuron can be implemented by a masked attention head with internal dimension 1 so long as the MLP's activation function comes from a restricted class including SiLU and close approximations of ReLU and GeLU. This allows one to convert an MLP-and-attention transformer into an attention-only transformer at the cost of greatly increasing the number of attention heads. We also prove that attention heads can perform the components of an MLP (linear transformations and activation functions) separately. Finally, we prove that attention heads can encode arbitrary masking patterns in their weight matrices to within arbitrarily small error.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to show how to use attention heads to achieve the functions of a multi - layer perceptron (MLP), thereby transforming the traditional Transformer model that contains attention mechanisms and MLP into a Transformer model that only uses attention mechanisms. Specifically, the paper proves the following points: 1. **MLP neurons can be implemented with masked attention heads**: As long as the activation function of the MLP belongs to a specific class of functions (including SiLU and its approximate forms ReLU and GeLU), the functions of MLP neurons can be achieved through a masked attention head with an internal dimension of 1. This makes it possible to transform a Transformer model that contains MLP and attention mechanisms into a Transformer model that only uses attention mechanisms, although this will greatly increase the number of attention heads. 2. **Attention heads can perform the linear transformation and activation function of MLP separately**: The paper proves that attention heads can perform the linear transformation and activation function operations in MLP separately. 3. **Attention heads can encode any mask pattern in the weight matrix**: The paper also proves that by adding specific terms to the weight matrix, attention heads can encode any mask pattern in the weight matrix with an arbitrarily small error. The significance of these results is that, in theory, a model with performance equivalent to that of the traditional Transformer model can be constructed by only using attention mechanisms, which provides new perspectives and methods for the study of the mechanism interpretability of Transformer models. In particular, this method can apply successful attention mechanism interpretation techniques to MLP layers, thereby promoting a more comprehensive understanding of Transformer models.