Abstract:Channel and spatial attention mechanisms introduced by earlier works enhance the representation abilities of deep convolutional neural networks (CNNs) but often lead to increased parameter and computation costs. While recent approaches focus solely on efficient feature context modeling for channel attention, we aim to model both channel and spatial attention comprehensively with minimal parameters and reduced computation. Leveraging the principles of relational modeling in graphs, we introduce a constant-parameter module, STEAM: Squeeze and Transform Enhanced Attention Module, which integrates channel and spatial attention to enhance the representation power of CNNs. To our knowledge, we are the first to propose a graph-based approach for modeling both channel and spatial attention, utilizing concepts from multi-head graph transformers. Additionally, we introduce Output Guided Pooling (OGP), which efficiently captures spatial context to further enhance spatial attention. We extensively evaluate STEAM for large-scale image classification, object detection and instance segmentation on standard benchmark datasets. STEAM achieves a 2% increase in accuracy over the standard ResNet-50 model with only a meager increase in GFLOPs. Furthermore, STEAM outperforms leading modules ECA and GCT in terms of accuracy while achieving a three-fold reduction in GFLOPs.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: Although the existing channel and spatial attention mechanisms enhance the representational ability of deep convolutional neural networks (CNNs), they usually lead to an increase in parameters and computational cost. To address this issue, the authors propose a new method - STEAM (Squeeze and Transform Enhanced Attention Module), aiming to comprehensively model channel and spatial attention with minimal parameter and computational overhead. Specifically, the goals of this paper include: 1. **Reduce parameters and computational amount**: Compared with existing methods, STEAM reduces the number of parameters and computational complexity by introducing a constant - parameter module. 2. **Improve model performance**: While maintaining low computational overhead, improve the performance of the model on large - scale image classification, object detection, and instance segmentation tasks. 3. **Comprehensively model channel and spatial attention**: Utilize the principles of graph - relation modeling to handle channel and spatial attention simultaneously, thereby more effectively capturing the dependencies between features. ### Main contributions - **Graph - based modeling of channel and spatial attention**: By defining channel graphs and spatial graphs, efficient representation learning is achieved. - **Multi - head attention mechanism**: Inspired by graph transformers, use multi - head attention to capture multiple relationships in channel and spatial graphs. - **Output - Guided Pooling (OGP)**: A new technique for efficiently capturing spatial context and enhancing spatial - attention modeling. - **Constant - parameter module**: Developed a constant - parameter module independent of the backbone network, enabling it to be seamlessly integrated into various network architectures. ### Experimental results The authors conducted large - scale image classification experiments on the ImageNet dataset and object detection and instance segmentation experiments on the MS COCO dataset. The experimental results show that STEAM not only outperforms the current SOTA modules in performance but also exhibits higher efficiency in terms of parameters and computational amount. For example, when STEAM is integrated into ResNet - 50, it only adds 320 parameters and 3.57e - 3 GFLOPs, but increases the Top - 1 accuracy by 2%. Through these improvements, STEAM provides an effective and efficient solution that can improve the performance of deep - learning models without significantly increasing computational resources.

STEAM: Squeeze and Transform Enhanced Attention Module

Deep Dual-Stream Network with Scale Context Selection Attention Module for Semantic Segmentation

ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks

HAM: Hybrid Attention Module in Deep Convolutional Neural Networks for Image Classification

CAT: Learning to Collaborate Channel and Spatial Attention from Multi-Information Fusion

A Symmetric Efficient Spatial and Channel Attention (ESCA) Module Based on Convolutional Neural Networks

CSA-Net: Deep Cross-Complementary Self Attention and Modality-Specific Preservation for Saliency Detection

GAttANet: Global attention agreement for convolutional neural networks

Efficient Multi-Scale Attention Module with Cross-Spatial Learning

ParFormer: A Vision Transformer with Parallel Mixer and Sparse Channel Attention Patch Embedding

ELA: Efficient Local Attention for Deep Convolutional Neural Networks

PCSA: Enhancing CNN Performance With Pyramid Channel and Spatial Attention

Attention-guided chained context aggregation for semantic segmentation

DECA: a novel multi-scale efficient channel attention module for object detection in real-life fire images

DAS: A Deformable Attention to Capture Salient Information in CNNs

SA-Net: Shuffle Attention for Deep Convolutional Neural Networks

CAA : Channelized Axial Attention for Semantic Segmentation.

A Spatial–Channel–Temporal-Fused Attention for Spiking Neural Networks

Agent Attention: On the Integration of Softmax and Linear Attention

CA-Stream: Attention-based pooling for interpretable image recognition