STEAM: Squeeze and Transform Enhanced Attention Module

Rishabh Sabharwal,Ram Samarth B B,Parikshit Singh Rathore,Punit Rathore
2024-12-12
Abstract:Channel and spatial attention mechanisms introduced by earlier works enhance the representation abilities of deep convolutional neural networks (CNNs) but often lead to increased parameter and computation costs. While recent approaches focus solely on efficient feature context modeling for channel attention, we aim to model both channel and spatial attention comprehensively with minimal parameters and reduced computation. Leveraging the principles of relational modeling in graphs, we introduce a constant-parameter module, STEAM: Squeeze and Transform Enhanced Attention Module, which integrates channel and spatial attention to enhance the representation power of CNNs. To our knowledge, we are the first to propose a graph-based approach for modeling both channel and spatial attention, utilizing concepts from multi-head graph transformers. Additionally, we introduce Output Guided Pooling (OGP), which efficiently captures spatial context to further enhance spatial attention. We extensively evaluate STEAM for large-scale image classification, object detection and instance segmentation on standard benchmark datasets. STEAM achieves a 2% increase in accuracy over the standard ResNet-50 model with only a meager increase in GFLOPs. Furthermore, STEAM outperforms leading modules ECA and GCT in terms of accuracy while achieving a three-fold reduction in GFLOPs.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: Although the existing channel and spatial attention mechanisms enhance the representational ability of deep convolutional neural networks (CNNs), they usually lead to an increase in parameters and computational cost. To address this issue, the authors propose a new method - STEAM (Squeeze and Transform Enhanced Attention Module), aiming to comprehensively model channel and spatial attention with minimal parameter and computational overhead. Specifically, the goals of this paper include: 1. **Reduce parameters and computational amount**: Compared with existing methods, STEAM reduces the number of parameters and computational complexity by introducing a constant - parameter module. 2. **Improve model performance**: While maintaining low computational overhead, improve the performance of the model on large - scale image classification, object detection, and instance segmentation tasks. 3. **Comprehensively model channel and spatial attention**: Utilize the principles of graph - relation modeling to handle channel and spatial attention simultaneously, thereby more effectively capturing the dependencies between features. ### Main contributions - **Graph - based modeling of channel and spatial attention**: By defining channel graphs and spatial graphs, efficient representation learning is achieved. - **Multi - head attention mechanism**: Inspired by graph transformers, use multi - head attention to capture multiple relationships in channel and spatial graphs. - **Output - Guided Pooling (OGP)**: A new technique for efficiently capturing spatial context and enhancing spatial - attention modeling. - **Constant - parameter module**: Developed a constant - parameter module independent of the backbone network, enabling it to be seamlessly integrated into various network architectures. ### Experimental results The authors conducted large - scale image classification experiments on the ImageNet dataset and object detection and instance segmentation experiments on the MS COCO dataset. The experimental results show that STEAM not only outperforms the current SOTA modules in performance but also exhibits higher efficiency in terms of parameters and computational amount. For example, when STEAM is integrated into ResNet - 50, it only adds 320 parameters and 3.57e - 3 GFLOPs, but increases the Top - 1 accuracy by 2%. Through these improvements, STEAM provides an effective and efficient solution that can improve the performance of deep - learning models without significantly increasing computational resources.