Abstract:Convolutions (Convs) and multi-head self-attentions (MHSAs) are typically considered alternatives to each other for building vision backbones. Although some works try to integrate both, they apply the two operators simultaneously at the finest pixel granularity. With Convs responsible for per-pixel feature extraction already, the question is whether we still need to include the heavy MHSAs at such a fine-grained level. In fact, this is the root cause of the scalability issue w.r.t. the input resolution for vision transformers. To address this important problem, we propose in this work to use MSHAs and Convs in parallel \textbf{at different granularity levels} instead. Specifically, in each layer, we use two different ways to represent an image: a fine-grained regular grid and a coarse-grained set of semantic slots. We apply different operations to these two representations: Convs to the grid for local features, and MHSAs to the slots for global features. A pair of fully differentiable soft clustering and dispatching modules is introduced to bridge the grid and set representations, thus enabling local-global fusion. Through extensive experiments on various vision tasks, we empirically verify the potential of the proposed integration scheme, named \textit{GLMix}: by offloading the burden of fine-grained features to light-weight Convs, it is sufficient to use MHSAs in a few (e.g., 64) semantic slots to match the performance of recent state-of-the-art backbones, while being more efficient. Our visualization results also demonstrate that the soft clustering module produces a meaningful semantic grouping effect with only IN1k classification supervision, which may induce better interpretability and inspire new weakly-supervised semantic segmentation approaches. Code will be available at \url{<a class="link-external link-https" href="https://github.com/rayleizhu/GLMix" rel="external noopener nofollow">this https URL</a>}.

Look and Think: Intrinsic Unification of Self-Attention and Convolution for Spatial-Channel Specificity

CSA-Net: Deep Cross-Complementary Self Attention and Modality-Specific Preservation for Saliency Detection

SCSA: Exploring the Synergistic Effects Between Spatial and Channel Attention

On the Integration of Self-Attention and Convolution

HAM: Hybrid Attention Module in Deep Convolutional Neural Networks for Image Classification

X-volution: On the unification of convolution and self-attention

CAT: Learning to Collaborate Channel and Spatial Attention from Multi-Information Fusion

Self-attentional Convolution for Neural Networks

Locally Enhanced Self-Attention: Combining Self-Attention and Convolution as Local and Context Terms

Revisiting the Integration of Convolution and Attention for Vision Backbone

SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning

Nonlocal Spatial Attention Module for Image Classification

CSA-Net: Channel-wise Spatially Autocorrelated Attention Networks

Efficient Multi-Scale Attention Module with Cross-Spatial Learning

Spatial Decomposition and Aggregation for Attention in Convolutional Neural Networks

Spatial Global Context Attention for Convolutional Neural Networks: an Efficient Method

Spatial Group and Cross-Channel Attention: Make Smaller Models More Effective, Focus on High-Level Semantic Features

Spatial Group-Wise Enhance: Enhancing Semantic Feature Learning in CNN

Global Attention Mechanism: Retain Information to Enhance Channel-Spatial Interactions

On the Relationship between Self-Attention and Convolutional Layers

MCA: Multidimensional Collaborative Attention in Deep Convolutional Neural Networks for Image Recognition