Abstract:Efficient custom pooling techniques that can aggressively trim the dimensions of a feature map and thereby reduce inference compute and memory footprint for resource-constrained computer vision applications have recently gained significant traction. However, prior pooling works extract only the local context of the activation maps, limiting their effectiveness. In contrast, we propose a novel non-local self-attentive pooling method that can be used as a drop-in replacement to the standard pooling layers, such as max/average pooling or strided convolution. The proposed self-attention module uses patch embedding, multi-head self-attention, and spatial-channel restoration, followed by sigmoid activation and exponential soft-max. This self-attention mechanism efficiently aggregates dependencies between non-local activation patches during down-sampling. Extensive experiments on standard object classification and detection tasks with various convolutional neural network (CNN) architectures demonstrate the superiority of our proposed mechanism over the state-of-the-art (SOTA) pooling techniques. In particular, we surpass the test accuracy of existing pooling techniques on different variants of MobileNet-V2 on ImageNet by an average of 1.2%. With the aggressive down-sampling of the activation maps in the initial layers (providing up to 22x reduction in memory consumption), our approach achieves 1.43% higher test accuracy compared to SOTA techniques with iso-memory footprints. This enables the deployment of our models in memory-constrained devices, such as micro-controllers (without losing significant accuracy), because the initial activation maps consume a significant amount of on-chip memory for high-resolution images required for complex vision tasks. Our proposed pooling method also leverages the idea of channel pruning to further reduce memory footprints.

Generic Convolutional Neural Network with Random Pooling Area

Stochastic Area Pooling for Generic Convolutional Neural Network

Multi-scale Convolution Aggregation and Stochastic Feature Reuse for DenseNets

Deep Superpixel Convolutional Network for Image Recognition

A improved pooling method for convolutional neural networks

Balanced Mixture of SuperNets for Learning the CNN Pooling Architecture

Adaptive Salience Preserving Pooling for Deep Convolutional Neural Networks

Cascaded Subpatch Networks for Effective CNNs

Cross-convolutional-layer Pooling for Generic Visual Recognition.

CSNN: an Augmented Spiking Based Framework with Perceptron-Inception

An Area-Efficient CNN Accelerator Supporting Global Average Pooling with Arbitrary Shapes

Wasserstein Pooling for Image Classification

Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree

Deep CNNs Meet Global Covariance Pooling: Better Representation and Generalization

Self-Attentive Pooling for Efficient Deep Learning

Stacked Pooling: Improving Crowd Counting by Boosting Scale Invariance

Gated Square-Root Pooling For Image Instance Retrieval

No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects

Attribute Aware Pooling for Pedestrian Attribute Recognition

Convolutional Neural Networks: A Comprehensive Evaluation and Benchmarking of Pooling Layer Variants

CSPNet: A New Backbone that can Enhance Learning Capability of CNN