Abstract:Successful visual recognition networks benefit from aggregating information spanning from a wide range of scales. Previous research has investigated information fusion of connected layers or multiple branches in a block, seeking to strengthen the power of multi-scale representations. Despite their great successes, existing practices often allocate the neurons for each scale manually, and keep the same ratio in all aggregation blocks of an entire network, rendering suboptimal performance. In this paper, we propose to learn the neuron allocation for aggregating multi-scale information in different building blocks of a deep network. The most informative output neurons in each block are preserved while others are discarded, and thus neurons for multiple scales are competitively and adaptively allocated. Our scale aggregation network (ScaleNet) is constructed by repeating a scale aggregation (SA) block that concatenates feature maps at a wide range of scales. Feature maps for each scale are generated by a stack of downsampling, convolution and upsampling operations. The data-driven neuron allocation and SA block achieve strong representational power at the cost of considerably low computational complexity. The proposed ScaleNet, by replacing all 3x3 convolutions in ResNet with our SA blocks, achieves better performance than ResNet and its outstanding variants like ResNeXt and SE-ResNet, in the same computational complexity. On ImageNet classification, ScaleNets absolutely reduce the top-1 error rate of ResNets by 1.12 (101 layers) and 1.82 (50 layers). On COCO object detection, ScaleNets absolutely improve the mmAP with backbone of ResNets by 3.6 (101 layers) and 4.6 (50 layers) on Faster RCNN, respectively. Code and models are released at <a class="link-external link-https" href="https://github.com/Eli-YiLi/ScaleNet" rel="external noopener nofollow">this https URL</a>.

ScaleNet: An Unsupervised Representation Learning Method for Limited Information

Dynamic Convolution Covariance Network Using Multi-Scale Feature Fusion for Remote Sensing Scene Image Classification

RINet: Efficient 3D Lidar-Based Place Recognition Using Rotation Invariant Neural Network

ScaleFormer: Revisiting the Transformer-based Backbones from a Scale-wise Perspective for Medical Image Segmentation.

Remote Sensing Scene Image Classification Model Based on Multi-Scale Features and Attention Mechanism

LPNet: A Remote Sensing Scene Classification Method Based on Large Kernel Convolution and Parameter Fusion

Unsupervised Representation Learning by Predicting Image Rotations

ScaleNAS: One-Shot Learning of Scale-Aware Representations for Visual Recognition

Data-Driven Neuron Allocation for Scale Aggregation Networks

TransformNet: Self-supervised representation learning through predicting geometric transformations

Scale-Net: Learning to Reduce Scale Differences for Large-Scale Invariant Image Matching

ScaleNet: Searching for the Model to Scale.

Image Recognition Using Scale Recurrent Neural Networks

Scale Attention for Learning Deep Face Representation: A Study Against Visual Scale Variation

NeuralScale: Efficient Scaling of Neurons for Resource-Constrained Deep Neural Networks

Scale-wise Convolution for Image Restoration

AutoScaler: Scale-Attention Networks for Visual Correspondence

R-Net: A Relationship Network for Efficient and Accurate Scene Text Detection

Self-supervised Scale Equivariant Network for Weakly Supervised Semantic Segmentation

Scale-Aware Neural Network for Semantic Segmentation of Multi-Resolution Remote Sensing Images

Scene recognition with CNNs: objects, scales and dataset bias