MetaLA: Unified Optimal Linear Approximation to Softmax Attention Map

Yuhong Chou,Man Yao,Kexin Wang,Yuqi Pan,Ruijie Zhu,Yiran Zhong,Yu Qiao,Jibin Wu,Bo Xu,Guoqi Li
2024-11-16
Abstract:Various linear complexity models, such as Linear Transformer (LinFormer), State Space Model (SSM), and Linear RNN (LinRNN), have been proposed to replace the conventional softmax attention in Transformer structures. However, the optimal design of these linear models is still an open question. In this work, we attempt to answer this question by finding the best linear approximation to softmax attention from a theoretical perspective. We start by unifying existing linear complexity models as the linear attention form and then identify three conditions for the optimal linear attention design: 1) Dynamic memory ability; 2) Static approximation ability; 3) Least parameter approximation. We find that none of the current linear models meet all three conditions, resulting in suboptimal performance. Instead, we propose Meta Linear Attention (MetaLA) as a solution that satisfies these conditions. Our experiments on Multi-Query Associative Recall (MQAR) task, language modeling, image classification, and Long-Range Arena (LRA) benchmark demonstrate that MetaLA is more effective than the existing linear models.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem that existing linear - complexity models (such as Linear Transformer, State Space Model, and Linear RNN) fail to achieve the optimal design when replacing the softmax attention mechanism in the traditional Transformer structure. Specifically, although these linear models can achieve linear - time - complexity calculations, their performance is still insufficient. #### Main problems: 1. **Limitations of existing linear models**: - Current linear - complexity models (such as LinFormer, SSM, LinRNN) can reduce computational complexity, but they fail to fully meet the following three conditions: - **Dynamic memory ability**: The ability to adaptively store important information and forget unimportant information when processing input sequences. - **Static approximation ability**: The ability to approximate any softmax attention map. - **Minimum - parameter approximation**: Use as few parameters as possible while meeting the first two conditions. 2. **Design of optimal linear approximation**: - The paper proposes a unified linear attention form, abstracts existing linear models into this form, and defines the necessary conditions for achieving "optimal linear approximation". - Based on these conditions, the authors propose the Meta Linear Attention (MetaLA) model to meet all these conditions and thus achieve better performance. #### Solutions: - **MetaLA module**: - **Remove unnecessary Key matrices**: Through theoretical analysis, it is found that the Key matrix is not necessary, so the model structure can be simplified. - **Self - enhancement mechanism**: Enhance the attention of each token to itself to avoid attention dilution. - **Short convolution**: Introduce short convolution to enhance local interaction. - **Experimental verification**: - Experiments were carried out on multiple tasks, including multi - query associative recall (MQAR), language modeling, long - sequence modeling, and image classification, etc., which proved the effectiveness of MetaLA. ### Summary: This paper solves the problem that existing linear - complexity models fail to achieve the optimal design when replacing the softmax attention mechanism by proposing the MetaLA model, achieving better performance and higher efficiency.