Abstract:The attention module is the key component in Transformers. While the global attention mechanism offers high expressiveness, its excessive computational cost restricts its applicability in various scenarios. In this paper, we propose a novel attention paradigm, Agent Attention, to strike a favorable balance between computational efficiency and representation power. Specifically, the Agent Attention, denoted as a quadruple $(Q, A, K, V)$, introduces an additional set of agent tokens $A$ into the conventional attention module. The agent tokens first act as the agent for the query tokens $Q$ to aggregate information from $K$ and $V$, and then broadcast the information back to $Q$. Given the number of agent tokens can be designed to be much smaller than the number of query tokens, the agent attention is significantly more efficient than the widely adopted Softmax attention, while preserving global context modelling capability. Interestingly, we show that the proposed agent attention is equivalent to a generalized form of linear attention. Therefore, agent attention seamlessly integrates the powerful Softmax attention and the highly efficient linear attention. Extensive experiments demonstrate the effectiveness of agent attention with various vision Transformers and across diverse vision tasks, including image classification, object detection, semantic segmentation and image generation. Notably, agent attention has shown remarkable performance in high-resolution scenarios, owning to its linear attention nature. For instance, when applied to Stable Diffusion, our agent attention accelerates generation and substantially enhances image generation quality without any additional training. Code is available at <a class="link-external link-https" href="https://github.com/LeapLabTHU/Agent-Attention" rel="external noopener nofollow">this https URL</a>.

$k$NN Attention Demystified: A Theoretical Exploration for Scalable Transformers

Faster Nearest Neighbor Machine Translation

Attention as an RNN

KVT: K-Nn Attention for Boosting Vision Transformers.

A Primal-Dual Framework for Transformers and Neural Networks

Dissecting the Interplay of Attention Paths in a Statistical Mechanics Theory of Transformers

Transforming Recurrent Neural Networks with Attention and Fixed-point Equations

QKFormer: Hierarchical Spiking Transformer using Q-K Attention

Gated recurrent neural networks discover attention

Spiking Transformer with Spatial-Temporal Attention

Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

Improving Transformers with Probabilistic Attention Keys

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

Dynamical Mean-Field Theory of Self-Attention Neural Networks

Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers

Breaking the Attention Bottleneck

Spatio-Temporal Approximation: A Training-Free SNN Conversion for Transformers

Transformer Neural Processes -- Kernel Regression

Attention as a Hypernetwork

Centroid Transformers: Learning to Abstract with Attention

Agent Attention: On the Integration of Softmax and Linear Attention