Abstract:The quadratic complexity of the attention mechanism represents one of the biggest hurdles for processing long sequences using Transformers. Current methods, relying on sparse representations or stateful recurrence, sacrifice token-to-token interactions, which ultimately leads to compromises in performance. This paper introduces TaylorShift, a novel reformulation of the Taylor softmax that enables computing full token-to-token interactions in linear time and space. We analytically determine the crossover points where employing TaylorShift becomes more efficient than traditional attention, aligning closely with empirical measurements. Specifically, our findings demonstrate that TaylorShift enhances memory efficiency for sequences as short as 800 tokens and accelerates inference for inputs of approximately 1700 tokens and beyond. For shorter sequences, TaylorShift scales comparably with the vanilla attention. Furthermore, a classification benchmark across five tasks involving long sequences reveals no degradation in accuracy when employing Transformers equipped with TaylorShift. For reproducibility, we provide access to our code under <a class="link-external link-https" href="https://github.com/tobna/TaylorShift" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the computational complexity problem of the self - attention mechanism in the Transformer model. Specifically, the time and space complexity of the standard self - attention mechanism is the square of the input sequence length \(O(N^2)\), which makes it very difficult and resource - intensive to process long sequences. To solve this problem, the paper proposes the TaylorShift method. By approximating the softmax function through Taylor expansion, the complexity of the self - attention mechanism is reduced from square to linear \(O(N)\), so that long sequences can be processed more efficiently. The following is a summary of the key content of the paper: 1. **Problem Background**: - The Transformer model has achieved great success in fields such as natural language processing (NLP) and computer vision (CV). - However, the square complexity \(O(N^2)\) of its self - attention mechanism limits the effective processing of long sequences. - Some existing linear - complexity methods (such as sparse attention, state recursion, etc.) reduce the complexity but sacrifice token - to - token interaction, resulting in performance degradation. 2. **Proposed Method**: - The paper introduces TaylorShift. By approximating the softmax function through Taylor expansion, it realizes reducing the computational complexity from square to linear while maintaining complete token - to - token interaction. - TaylorShift not only improves the efficiency of long - sequence processing but also can be comparable to the traditional attention mechanism on shorter sequences. 3. **Technical Details**: - **Direct TaylorShift**: Directly use Taylor expansion to approximate the softmax function, calculate the complete token - to - token interaction matrix, and then multiply it by the value matrix \(V\). - **Efficient TaylorShift**: By re - arranging the operation order, distribute the influence of Taylor expansion to the query matrix \(Q\) and the key matrix \(K\), and postpone the normalization operation to the last, thereby reducing the complexity from \(O(N^2d)\) to \(O(Nd^3)\). - **Normalization Scheme**: Introduce a new normalization scheme to ensure that the intermediate results will not fail to converge due to numerical overflow. 4. **Experimental Verification**: - The experimental results show that for sequences with a length of about 800 tokens, TaylorShift significantly improves memory efficiency; for sequences with a length of about 1,700 tokens and above, it accelerates the inference process. - In classification tasks on multiple datasets, the Transformer model using TaylorShift shows performance comparable to or even better than that of the standard Transformer, especially when processing long sequences. 5. **Conclusion**: - TaylorShift effectively solves the computational complexity problem of the Transformer model when processing long sequences while maintaining the performance and expressiveness of the model. - This method has broad application potential in various application scenarios such as processing long texts and high - resolution images. Through these improvements, TaylorShift provides a more efficient solution for the Transformer model when processing long - sequence data, promoting the application and development of deep learning in more fields.

TaylorShift: Shifting the Complexity of Self-Attention from Squared to Linear (and Back) using Taylor-Softmax

Agent Attention: On the Integration of Softmax and Linear Attention

Adaptive Multi-Resolution Attention with Linear Complexity

Superiority of Softmax: Unveiling the Performance Edge Over Linear Attention

Fast Multipole Attention: A Divide-and-Conquer Attention Mechanism for Long Sequences

Latte: Latent Attention for Linear Time Transformers

EulerFormer: Sequential User Behavior Modeling with Complex Vector Attention

Flowformer: Linearizing Transformers with Conservation Flows

Faster Causal Attention Over Large Sequences Through Sparse Flash Attention

Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers

Agglomerative Attention

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Cottention: Linear Transformers With Cosine Attention

Exploring Attention Map Reuse for Efficient Transformer Neural Networks

Efficient Time Series Processing for Transformers and State-Space Models through Token Merging

SEA: Sparse Linear Attention with Estimated Attention Mask

When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism

SOFT: Softmax-free Transformer with Linear Complexity

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

Softmax-Free Linear Transformers