Abstract:Various linear complexity models, such as Linear Transformer (LinFormer), State Space Model (SSM), and Linear RNN (LinRNN), have been proposed to replace the conventional softmax attention in Transformer structures. However, the optimal design of these linear models is still an open question. In this work, we attempt to answer this question by finding the best linear approximation to softmax attention from a theoretical perspective. We start by unifying existing linear complexity models as the linear attention form and then identify three conditions for the optimal linear attention design: 1) Dynamic memory ability; 2) Static approximation ability; 3) Least parameter approximation. We find that none of the current linear models meet all three conditions, resulting in suboptimal performance. Instead, we propose Meta Linear Attention (MetaLA) as a solution that satisfies these conditions. Our experiments on Multi-Query Associative Recall (MQAR) task, language modeling, image classification, and Long-Range Arena (LRA) benchmark demonstrate that MetaLA is more effective than the existing linear models.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem that existing linear - complexity models (such as Linear Transformer, State Space Model, and Linear RNN) fail to achieve the optimal design when replacing the softmax attention mechanism in the traditional Transformer structure. Specifically, although these linear models can achieve linear - time - complexity calculations, their performance is still insufficient. #### Main problems: 1. **Limitations of existing linear models**: - Current linear - complexity models (such as LinFormer, SSM, LinRNN) can reduce computational complexity, but they fail to fully meet the following three conditions: - **Dynamic memory ability**: The ability to adaptively store important information and forget unimportant information when processing input sequences. - **Static approximation ability**: The ability to approximate any softmax attention map. - **Minimum - parameter approximation**: Use as few parameters as possible while meeting the first two conditions. 2. **Design of optimal linear approximation**: - The paper proposes a unified linear attention form, abstracts existing linear models into this form, and defines the necessary conditions for achieving "optimal linear approximation". - Based on these conditions, the authors propose the Meta Linear Attention (MetaLA) model to meet all these conditions and thus achieve better performance. #### Solutions: - **MetaLA module**: - **Remove unnecessary Key matrices**: Through theoretical analysis, it is found that the Key matrix is not necessary, so the model structure can be simplified. - **Self - enhancement mechanism**: Enhance the attention of each token to itself to avoid attention dilution. - **Short convolution**: Introduce short convolution to enhance local interaction. - **Experimental verification**: - Experiments were carried out on multiple tasks, including multi - query associative recall (MQAR), language modeling, long - sequence modeling, and image classification, etc., which proved the effectiveness of MetaLA. ### Summary: This paper solves the problem that existing linear - complexity models fail to achieve the optimal design when replacing the softmax attention mechanism by proposing the MetaLA model, achieving better performance and higher efficiency.

MetaLA: Unified Optimal Linear Approximation to Softmax Attention Map

Breaking the Low-Rank Dilemma of Linear Attention

Superiority of Softmax: Unveiling the Performance Edge Over Linear Attention

Bridging the Divide: Reconsidering Softmax and Linear Attention

Agent Attention: On the Integration of Softmax and Linear Attention

LoLCATs: On Low-Rank Linearizing of Large Language Models

SEA: Sparse Linear Attention with Estimated Attention Mask

Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration

Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization

Linear Attention via Orthogonal Memory

MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression

Cross-layer Attention Sharing for Large Language Models

Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix

The Closeness of In-Context Learning and Weight Shifting for Softmax Regression

Softmax-free Linear Transformers

A Cheap Linear Attention Mechanism with Fast Lookups and Fixed-Size Representations

MultiMax: Sparse and Multi-Modal Attention Learning

Adaptive Multi-Resolution Attention with Linear Complexity

Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences