Abstract:Random feature attention (RFA) adopts random fourier feature (RFF) methods to approximate the softmax function, resulting in a linear time and space attention mechanism that enables the construction of an efficient Transformer. Inspired by RFA, we propose Macformer, a Transformer architecture that employs random Maclaurin features (RMF) to approximate various dot-product kernels, thereby accelerating attention computations for long sequence. Macformer consists of Random Maclaurin Feature Attention (RMFA) and pre-post Scaling Batch Normalization (ppSBN), the former is an unbiased approximation for dot-product kernelized attention and the later is a two-stage regularization mechanism guaranteeing the error of RMFA. We conducted toy experiments to demonstrate the efficiency of RMFA and ppSBN, and experiments on long range arena (LRA) benchmark to validate the acceleration and accuracy of Macformer with different dot-product kernels. Experiment results of Macformer are consistent with our theoretical analysis.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the computational efficiency and performance bottleneck problems of the Transformer model when processing long - sequence data. Specifically, the time complexity of the self - attention mechanism of Transformer is \(O(n^2)\), which makes the computational cost huge when processing long sequences and becomes a bottleneck of the model performance. To this end, the authors propose **Macformer**, a Transformer architecture based on the Random Maclaurin Features (RMF) approximate dot - product kernel function. By introducing RMF, Macformer can accelerate attention calculation and maintain high accuracy and efficiency when processing long sequences. #### Main problems: 1. **Computational efficiency problem**: The time complexity of the self - attention mechanism of the traditional Transformer is \(O(n^2)\) when processing long sequences, resulting in excessive computational cost. 2. **Flexibility problem**: The Softmax attention mechanism uses the exponential function as the similarity measure, but different tasks may require different similarity functions. Therefore, a more flexible method is needed to select an appropriate similarity function. #### Solutions: - **Random Maclaurin Feature Attention (RMFA)**: Use RMF to approximate various dot - product kernel functions, thereby linearizing Softmax attention calculation and improving computational efficiency. - **Pre - post Scaling Batch Normalization (ppSBN)**: Through a two - stage regularization mechanism, ensure that the input space of RMFA is constrained within \(\ell_2(0,1)\), guarantee the unbiasedness of the approximation and improve stability. Through these improvements, Macformer shows higher efficiency and flexibility when processing long - sequence data, while maintaining good accuracy. The experimental results verify the superior performance of Macformer on different tasks, especially in long - sequence processing. ### Summary: The main goal of this paper is to improve the self - attention mechanism of Transformer by introducing RMF and ppSBN, thereby improving its computational efficiency and performance when processing long - sequence data.

Macformer: Transformer with Random Maclaurin Feature Attention

TFEformer: Temporal Feature Enhanced Transformer for Multivariate Time Series Forecasting

Spectraformer: A Unified Random Feature Framework for Transformer

Adaptive Multi-Resolution Attention with Linear Complexity

Understanding and Improving Transformer from a Multi-Particle Dynamic System Point of View.

DARKER: Efficient Transformer with Data-Driven Attention Mechanism for Time Series

Transformer Acceleration with Dynamic Sparse Attention

An Efficient Two-Stage Pipelined Compute-in-Memory Macro for Accelerating Transformer Feed-Forward Networks

Generalized Probabilistic Attention Mechanism in Transformers

Energon: Toward Efficient Acceleration of Transformers Using Dynamic Sparse Attention.

Bitformer: An efficient Transformer with bitwise operation-based attention for Big Data Analytics at low-cost low-precision devices

FAST: Factorizable Attention for Speeding up Transformers

Nyströmformer: A Nyström-based Algorithm for Approximating Self-Attention

RSMformer: an efficient multiscale transformer-based framework for long sequence time-series forecasting

Breaking the Low-Rank Dilemma of Linear Attention

Attention as an RNN

FlashMask: Efficient and Rich Mask Extension of FlashAttention

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

What can a Single Attention Layer Learn? A Study Through the Random Features Lens

Latte: Latent Attention for Linear Time Transformers