Abstract:The attention module is the key component in Transformers. While the global attention mechanism offers high expressiveness, its excessive computational cost restricts its applicability in various scenarios. In this paper, we propose a novel attention paradigm, Agent Attention, to strike a favorable balance between computational efficiency and representation power. Specifically, the Agent Attention, denoted as a quadruple $(Q, A, K, V)$, introduces an additional set of agent tokens $A$ into the conventional attention module. The agent tokens first act as the agent for the query tokens $Q$ to aggregate information from $K$ and $V$, and then broadcast the information back to $Q$. Given the number of agent tokens can be designed to be much smaller than the number of query tokens, the agent attention is significantly more efficient than the widely adopted Softmax attention, while preserving global context modelling capability. Interestingly, we show that the proposed agent attention is equivalent to a generalized form of linear attention. Therefore, agent attention seamlessly integrates the powerful Softmax attention and the highly efficient linear attention. Extensive experiments demonstrate the effectiveness of agent attention with various vision Transformers and across diverse vision tasks, including image classification, object detection, semantic segmentation and image generation. Notably, agent attention has shown remarkable performance in high-resolution scenarios, owning to its linear attention nature. For instance, when applied to Stable Diffusion, our agent attention accelerates generation and substantially enhances image generation quality without any additional training. Code is available at <a class="link-external link-https" href="https://github.com/LeapLabTHU/Agent-Attention" rel="external noopener nofollow">this https URL</a>.

Input-independent Attention Weights Are Expressive Enough: A Study of Attention in Self-supervised Audio Transformers

Hand-crafted Attention is All You Need? A Study of Attention on Self-supervised Audio Transformer.

Hand-crafted Attention is All You Need? A Study of Attention on Self-supervised Audio Transformer

When to Use Efficient Self Attention? Profiling Text, Speech and Image Transformer Variants

Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency

Understanding Self-Attention of Self-Supervised Audio Transformers

Probing self-attention in self-supervised speech models for cross-linguistic differences

Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

Compressing Transformer-based self-supervised models for speech processing

SIMPLIFIED SELF-ATTENTION FOR TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION

Selective Attention Improves Transformer

An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

Exploring Self-Attention Mechanisms for Speech Separation

Dual-Branch Attention-In-Attention Transformer for Single-Channel Speech Enhancement

Relaxed Attention for Transformer Models

On the Expressive Power of Self-Attention Matrices

Lightweight Causal Transformer with Local Self-Attention for Real-Time Speech Enhancement

Agent Attention: On the Integration of Softmax and Linear Attention

Improving Transformers with Dynamically Composable Multi-Head Attention

SAMSA: Efficient Transformer for Many Data Modalities

Attention Is Not All You Need Anymore