MSEB: Plug and Play Multi-Scale Image Embedding Block for Vision Backbone

Hao Yuan,Bin Zhang,Yachuan Wang
DOI: https://doi.org/10.1016/j.neucom.2024.129040
IF: 6
2024-01-01
Neurocomputing
Abstract:Mainstream backbone networks in computer vision, leveraging sophisticated CNN or Transformer architectures, have achieved remarkable performance across various tasks. However, these works primarily focus on downsample ranges of 4× to 32× on input images, with limited exploration into the embedding portion of the network. In this paper, we propose a novel Multi-Scale Embedding Block (MSEB) to enhance embedding features. MSEB is constructed through a combination of Self-Calibrated Channel Dilate Unit (SCDU) and Multi-Scale Encoder Unit (MSEU). SCDU projects images into high-dimensional space, incorporating channel self-calibration to preserve fine-grained information in shallow feature maps, thereby providing a more comprehensive representation for subsequent network layers. MSEU employs a multi-scale cascade design with implicit feature sharing characteristics, expanding the receptive field scale range to enhance embedding representation. Extensive experiments on mainstream benchmarks demonstrate that our proposed multi-scale embedding block can be seamlessly integrated into most popular CNN and Transformer architectures such as ResNet, Res2Net, Swin-Transformer, and MViT. By replacing the existing embedding structure in these methods, significant improvements on ImageNet-1k datasets are observed, e.g., CNN-based ResNet-50 and further adopting a multi-scale design Res2Net-50 achieve Top-1 accuracy improvements of 0.47% and 1.22%, respectively. Attention-based Swin-Tiny and MViT-Tiny, both adopting multi-scale design, achieved Top-1 accuracy improvements of 2.57% and 1.05%, respectively.
What problem does this paper attempt to address?