Abstract:Efficient custom pooling techniques that can aggressively trim the dimensions of a feature map and thereby reduce inference compute and memory footprint for resource-constrained computer vision applications have recently gained significant traction. However, prior pooling works extract only the local context of the activation maps, limiting their effectiveness. In contrast, we propose a novel non-local self-attentive pooling method that can be used as a drop-in replacement to the standard pooling layers, such as max/average pooling or strided convolution. The proposed self-attention module uses patch embedding, multi-head self-attention, and spatial-channel restoration, followed by sigmoid activation and exponential soft-max. This self-attention mechanism efficiently aggregates dependencies between non-local activation patches during down-sampling. Extensive experiments on standard object classification and detection tasks with various convolutional neural network (CNN) architectures demonstrate the superiority of our proposed mechanism over the state-of-the-art (SOTA) pooling techniques. In particular, we surpass the test accuracy of existing pooling techniques on different variants of MobileNet-V2 on ImageNet by an average of 1.2%. With the aggressive down-sampling of the activation maps in the initial layers (providing up to 22x reduction in memory consumption), our approach achieves 1.43% higher test accuracy compared to SOTA techniques with iso-memory footprints. This enables the deployment of our models in memory-constrained devices, such as micro-controllers (without losing significant accuracy), because the initial activation maps consume a significant amount of on-chip memory for high-resolution images required for complex vision tasks. Our proposed pooling method also leverages the idea of channel pruning to further reduce memory footprints.

Delving deep into spatial pooling for squeeze-and-excitation networks

Squeeze Excitation Densely Connected Residual Convolutional Networks for Specific Emitter Identification Based on Measured Signals

Competitive Inner-Imaging Squeeze and Excitation for Residual Network

Concurrent Spatial and Channel Squeeze & Excitation in Fully Convolutional Networks

Self-Attentive Pooling for Efficient Deep Learning

Channel Locality Block: A Variant of Squeeze-and-Excitation

Spatially-Aware Context Neural Networks.

Squeeze aggregated excitation network

DAR-Net: Dynamic Aggregation Network for Semantic Scene Segmentation

Gated Square-Root Pooling For Image Instance Retrieval

SPATIAL MOMENT POOLING IMPROVES NEURAL IMAGE ASSESSMENT

Strip Pooling: Rethinking Spatial Pooling for Scene Parsing

Ms-senet: Enhancing Speech Emotion Recognition Through Multi-scale Feature Fusion With Squeeze-and-excitation Blocks

Combining Local and Global: Rich and Robust Feature Pooling for Visual Recognition.

Enhanced mechanisms of pooling and channel attention for deep learning feature maps

Optimizing motion detection performance: Harnessing the power of squeeze and excitation modules

Towards Efficient Scene Understanding Via Squeeze Reasoning.

Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional Networks

Generalized regular spatial pooling for image classification

Vortex Pooling: Improving Context Representation in Semantic Segmentation

A Lightweight Block with Information Flow Enhancement for Convolutional Neural Networks