TaylorShift: Shifting the Complexity of Self-Attention from Squared to Linear (and Back) using Taylor-Softmax

Tobias Christian Nauen,Sebastian Palacio,Andreas Dengel
2024-07-17
Abstract:The quadratic complexity of the attention mechanism represents one of the biggest hurdles for processing long sequences using Transformers. Current methods, relying on sparse representations or stateful recurrence, sacrifice token-to-token interactions, which ultimately leads to compromises in performance. This paper introduces TaylorShift, a novel reformulation of the Taylor softmax that enables computing full token-to-token interactions in linear time and space. We analytically determine the crossover points where employing TaylorShift becomes more efficient than traditional attention, aligning closely with empirical measurements. Specifically, our findings demonstrate that TaylorShift enhances memory efficiency for sequences as short as 800 tokens and accelerates inference for inputs of approximately 1700 tokens and beyond. For shorter sequences, TaylorShift scales comparably with the vanilla attention. Furthermore, a classification benchmark across five tasks involving long sequences reveals no degradation in accuracy when employing Transformers equipped with TaylorShift. For reproducibility, we provide access to our code under <a class="link-external link-https" href="https://github.com/tobna/TaylorShift" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the computational complexity problem of the self - attention mechanism in the Transformer model. Specifically, the time and space complexity of the standard self - attention mechanism is the square of the input sequence length \(O(N^2)\), which makes it very difficult and resource - intensive to process long sequences. To solve this problem, the paper proposes the TaylorShift method. By approximating the softmax function through Taylor expansion, the complexity of the self - attention mechanism is reduced from square to linear \(O(N)\), so that long sequences can be processed more efficiently. The following is a summary of the key content of the paper: 1. **Problem Background**: - The Transformer model has achieved great success in fields such as natural language processing (NLP) and computer vision (CV). - However, the square complexity \(O(N^2)\) of its self - attention mechanism limits the effective processing of long sequences. - Some existing linear - complexity methods (such as sparse attention, state recursion, etc.) reduce the complexity but sacrifice token - to - token interaction, resulting in performance degradation. 2. **Proposed Method**: - The paper introduces TaylorShift. By approximating the softmax function through Taylor expansion, it realizes reducing the computational complexity from square to linear while maintaining complete token - to - token interaction. - TaylorShift not only improves the efficiency of long - sequence processing but also can be comparable to the traditional attention mechanism on shorter sequences. 3. **Technical Details**: - **Direct TaylorShift**: Directly use Taylor expansion to approximate the softmax function, calculate the complete token - to - token interaction matrix, and then multiply it by the value matrix \(V\). - **Efficient TaylorShift**: By re - arranging the operation order, distribute the influence of Taylor expansion to the query matrix \(Q\) and the key matrix \(K\), and postpone the normalization operation to the last, thereby reducing the complexity from \(O(N^2d)\) to \(O(Nd^3)\). - **Normalization Scheme**: Introduce a new normalization scheme to ensure that the intermediate results will not fail to converge due to numerical overflow. 4. **Experimental Verification**: - The experimental results show that for sequences with a length of about 800 tokens, TaylorShift significantly improves memory efficiency; for sequences with a length of about 1,700 tokens and above, it accelerates the inference process. - In classification tasks on multiple datasets, the Transformer model using TaylorShift shows performance comparable to or even better than that of the standard Transformer, especially when processing long sequences. 5. **Conclusion**: - TaylorShift effectively solves the computational complexity problem of the Transformer model when processing long sequences while maintaining the performance and expressiveness of the model. - This method has broad application potential in various application scenarios such as processing long texts and high - resolution images. Through these improvements, TaylorShift provides a more efficient solution for the Transformer model when processing long - sequence data, promoting the application and development of deep learning in more fields.