Abstract:Transformer-based models have achieved state-of-the-art performance in many areas. However, the quadratic complexity of self-attention with respect to the input length hinders the applicability of Transformer-based models to long sequences. To address this, we present Fast Multipole Attention, a new attention mechanism that uses a divide-and-conquer strategy to reduce the time and memory complexity of attention for sequences of length $n$ from $\mathcal{O}(n^2)$ to $\mathcal{O}(n \log n)$ or $O(n)$, while retaining a global receptive field. The hierarchical approach groups queries, keys, and values into $\mathcal{O}( \log n)$ levels of resolution, where groups at greater distances are increasingly larger in size and the weights to compute group quantities are learned. As such, the interaction between tokens far from each other is considered in lower resolution in an efficient hierarchical manner. The overall complexity of Fast Multipole Attention is $\mathcal{O}(n)$ or $\mathcal{O}(n \log n)$, depending on whether the queries are down-sampled or not. This multi-level divide-and-conquer strategy is inspired by fast summation methods from $n$-body physics and the Fast Multipole Method. We perform evaluation on autoregressive and bidirectional language modeling tasks and compare our Fast Multipole Attention model with other efficient attention variants on medium-size datasets. We find empirically that the Fast Multipole Transformer performs much better than other efficient transformers in terms of memory size and accuracy. The Fast Multipole Attention mechanism has the potential to empower large language models with much greater sequence lengths, taking the full context into account in an efficient, naturally hierarchical manner during training and when generating long sequences.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: when the self - attention mechanism in the Transformer model processes long sequences, since its time complexity and space complexity are both $O(n^2)$, it leads to excessive consumption of computing resources, which limits its application in long - sequence tasks. To overcome this bottleneck, the author proposes the Fast Multipole Attention (FMA) mechanism. ### Specific description of the problem 1. **High computational complexity**: - The time complexity of the self - attention mechanism is $O(n^2)$. For an input sequence of length $n$, as $n$ increases, the amount of computation increases dramatically. - The space complexity is also $O(n^2)$. A large number of attention matrices need to be stored, which will lead to insufficient memory when processing long sequences. 2. **Limited applicability**: - Due to the above complexity problems, the Transformer model is difficult to be applied to tasks that need to process long sequences, such as long - text generation, document - level natural language processing, etc. ### Solution The author proposes a new attention mechanism - Fast Multipole Attention (FMA). Through the divide - and - conquer strategy, the time complexity and space complexity are reduced to $O(n\log n)$ or $O(n)$, while retaining the global receptive field. The specific methods are as follows: - **Hierarchical processing**: Group the queries, keys and values, and process them at different resolutions according to the distance. Tokens at a short distance maintain a high resolution, while tokens at a long distance are downsampled or summarized, thereby reducing the amount of computation. - **Multipole expansion**: Inspired by the Fast Multipole Method (FMM) in physics, the interaction between groups is calculated by learning basic functions to achieve efficient hierarchical computation. ### Advantages - **Efficiency improvement**: FMA can significantly reduce the computational and memory overhead while maintaining high accuracy and global receptive field. - **Suitable for long sequences**: It is especially suitable for processing long - sequence tasks, such as long - text generation and document - level natural language processing. - **Flexibility**: It can be used for both autoregressive tasks and bidirectional tasks, such as GPT and BERT series models. Through these improvements, FMA enables the Transformer model to process long - sequence data more efficiently and expands its scope of application in practical applications.

Fast Multipole Attention: A Divide-and-Conquer Attention Mechanism for Long Sequences

H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences

Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

Adaptive Multi-Resolution Attention with Linear Complexity

FAST: Factorizable Attention for Speeding up Transformers

Fastformer: Additive Attention Can Be All You Need

Agglomerative Attention

Long-range Sequence Modeling with Predictable Sparse Attention.

Long Sequence Modeling with Attention Tensorization: From Sequence to Tensor Learning

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences

Faster Causal Attention Over Large Sequences Through Sparse Flash Attention

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

FMMformer: Efficient and Flexible Transformer Via Decomposed Near-field and Far-field Attention

Luna: Linear Unified Nested Attention

Efficient Long Sequence Modeling Via State Space Augmented Transformer

Transformer Acceleration with Dynamic Sparse Attention

Mega: moving average equipped gated attention

Fovea Transformer: Efficient Long-Context Modeling with Structured Fine-to-Coarse Attention

Nyströmformer: A Nyström-based Algorithm for Approximating Self-Attention