Macformer: Transformer with Random Maclaurin Feature Attention

Yuhan Guo,Lizhong Ding,Ye Yuan,Guoren Wang
2024-08-21
Abstract:Random feature attention (RFA) adopts random fourier feature (RFF) methods to approximate the softmax function, resulting in a linear time and space attention mechanism that enables the construction of an efficient Transformer. Inspired by RFA, we propose Macformer, a Transformer architecture that employs random Maclaurin features (RMF) to approximate various dot-product kernels, thereby accelerating attention computations for long sequence. Macformer consists of Random Maclaurin Feature Attention (RMFA) and pre-post Scaling Batch Normalization (ppSBN), the former is an unbiased approximation for dot-product kernelized attention and the later is a two-stage regularization mechanism guaranteeing the error of RMFA. We conducted toy experiments to demonstrate the efficiency of RMFA and ppSBN, and experiments on long range arena (LRA) benchmark to validate the acceleration and accuracy of Macformer with different dot-product kernels. Experiment results of Macformer are consistent with our theoretical analysis.
Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the computational efficiency and performance bottleneck problems of the Transformer model when processing long - sequence data. Specifically, the time complexity of the self - attention mechanism of Transformer is \(O(n^2)\), which makes the computational cost huge when processing long sequences and becomes a bottleneck of the model performance. To this end, the authors propose **Macformer**, a Transformer architecture based on the Random Maclaurin Features (RMF) approximate dot - product kernel function. By introducing RMF, Macformer can accelerate attention calculation and maintain high accuracy and efficiency when processing long sequences. #### Main problems: 1. **Computational efficiency problem**: The time complexity of the self - attention mechanism of the traditional Transformer is \(O(n^2)\) when processing long sequences, resulting in excessive computational cost. 2. **Flexibility problem**: The Softmax attention mechanism uses the exponential function as the similarity measure, but different tasks may require different similarity functions. Therefore, a more flexible method is needed to select an appropriate similarity function. #### Solutions: - **Random Maclaurin Feature Attention (RMFA)**: Use RMF to approximate various dot - product kernel functions, thereby linearizing Softmax attention calculation and improving computational efficiency. - **Pre - post Scaling Batch Normalization (ppSBN)**: Through a two - stage regularization mechanism, ensure that the input space of RMFA is constrained within \(\ell_2(0,1)\), guarantee the unbiasedness of the approximation and improve stability. Through these improvements, Macformer shows higher efficiency and flexibility when processing long - sequence data, while maintaining good accuracy. The experimental results verify the superior performance of Macformer on different tasks, especially in long - sequence processing. ### Summary: The main goal of this paper is to improve the self - attention mechanism of Transformer by introducing RMF and ppSBN, thereby improving its computational efficiency and performance when processing long - sequence data.