Abstract:The attention module is the key component in Transformers. While the global attention mechanism offers high expressiveness, its excessive computational cost restricts its applicability in various scenarios. In this paper, we propose a novel attention paradigm, Agent Attention, to strike a favorable balance between computational efficiency and representation power. Specifically, the Agent Attention, denoted as a quadruple $(Q, A, K, V)$, introduces an additional set of agent tokens $A$ into the conventional attention module. The agent tokens first act as the agent for the query tokens $Q$ to aggregate information from $K$ and $V$, and then broadcast the information back to $Q$. Given the number of agent tokens can be designed to be much smaller than the number of query tokens, the agent attention is significantly more efficient than the widely adopted Softmax attention, while preserving global context modelling capability. Interestingly, we show that the proposed agent attention is equivalent to a generalized form of linear attention. Therefore, agent attention seamlessly integrates the powerful Softmax attention and the highly efficient linear attention. Extensive experiments demonstrate the effectiveness of agent attention with various vision Transformers and across diverse vision tasks, including image classification, object detection, semantic segmentation and image generation. Notably, agent attention has shown remarkable performance in high-resolution scenarios, owning to its linear attention nature. For instance, when applied to Stable Diffusion, our agent attention accelerates generation and substantially enhances image generation quality without any additional training. Code is available at <a class="link-external link-https" href="https://github.com/LeapLabTHU/Agent-Attention" rel="external noopener nofollow">this https URL</a>.

Convolution-enhanced Evolving Attention Networks

FAM: Improving columnar vision transformer with feature attention mechanism

GAttANet: Global attention agreement for convolutional neural networks

Demystify Transformers & Convolutions in Modern Image Deep Networks

An Attention Module for Convolutional Neural Networks

Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition

CAT: Cross Attention in Vision Transformer

ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image Segmentation

Predictive Attention Transformer: Improving Transformer with Attention Map Prediction

Attention is All you Need

Agent Attention: On the Integration of Softmax and Linear Attention

Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer

Adder Attention for Vision Transformer.

EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

X-volution: On the unification of convolution and self-attention

Attention as an RNN

Attentive Convolution: Equipping CNNs with RNN-style Attention Mechanisms

Dynamic Unary Convolution in Transformers

Entangled Transformer for Image Captioning

TransConvNet: Perform perceptually relevant driver's visual attention predictions

DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition